Guide to TCGA data
So you are interested in TCGA data but don't know where to start? Not sure if you need publicly available data or restricted data? You've come to the right place.
Do you need to download data?
It's possible that an online tool will be sufficient for what you are looking for.
cBioPortal is by far the most comprehensive interactive tool for analyzing TCGA data.
Some useful features of cBioPortal:
You will likely be focused on samples that have "01" or "11". It is important to note that LAML samples will be designated "03" since that is a blood-derived cancer, and SKCM has a lot of metastatic samples so you be dealing with a lot of "06" in that case.
But my files are named unc.edu.0046fe...!?!?!?!?!?!?
Ah yes, that is quite annoying, but nothing a Python dictionary won't solve ;)
When you download data from https://tcga-data.nci.nih.gov/tcga/ you also get a FILE_SAMPLE_MAP file which maps the patient barcodes to the files you downloaded. So a single patient in your clinical file might have an expression file for their normal tissue sample, one or more tumor samples, and maybe even a recurrent tumor or metastatic tumor.
Umm, are these gene names correct????
This is one of the main problems with Tier 3 data. The same pipeline with the same gene annotations is used in every cancer study, including newer cancer studies. And the first cancer study was like a long time ago, making these gene annotations really old.
As a result, if you are studying a lncRNA it probably won't be in TCGA Tier 3 data, and you need to check out MiTranscriptome or TANRIC.
Enough with this weak sauce, I need that Tier 1 data >:)
Okay, no problem, if you are associated with an university you can ask your advisor to apply for access.
Instructions are at https://cghub.ucsc.edu/access/get_access.html
Once you have the key file that you need, download GeneTorrent. This program has multiple dependencies, but I've seen it easily installed on a Mac and an Ubuntu server.
Once you've got that installed (preferably on a server with a good internet connection and a ton of memory and compute power), head over to https://browser.cghub.ucsc.edu.
Select the samples you want and add them to your cart (that's right, we're going shopping, and everything is free!). You will want to download the manifest file and the tsv file.
With your manifest file run this command on your server: gtdownload -v --max-children 1 -d manifest.xml -c cghub.key
#If you have a bunch of cores you can increase max-children
Download speed is fast, but for each file it takes some time to connect. So even if you are downloading a bunch of small files it might have to run overnight (make sure at the checkout you see how much data you are downloading so you don't fill your disk!).
With your TSV file you can map the analysis ids of the files you just downloaded to the patient barcodes.
I changed my mind, I'm not as hardcore as I thought :'(
Yeah...working with Tier 1 data is a huge pain. Luckily people have realized this and there are some pilot programs for analyzing TCGA data in the cloud that don't require downloading the raw files (but you will still need to save your processed files, which could be just as large or larger than the raw data). I haven't used any of these services yet so I can't recommend one or provide advice.
And that should get you started analyzing TCGA data. Anything important I missed?
- Getting a quick, easy to view summary of each cancer study. This includes overall survival of the patients, demographics, overall mutation and CNA counts
- Identifying which genes are most heavily mutated in each cancer or have undergone copy number alterations
- Identifying genes most highly co-expressed to your gene and the co-occurrence of mutations and CNAs
- Exploring protein and phosphoprotein levels
- Survival analysis (disease free and time to death) with either mutations, CNAs, or expression (microarray or RNA-SEQ)
- Expression is listed as z-scores instead of the raw values
- The Onco Query Language only allows for comparison of the altered group versus unaltered group (this prevents you from comparing highest expressing patients to lowest expressing patients)
- The miRNA data in cBioPortal suffers from the fact that the TCGA Tier 3 annotations are out of date: expression is for stem-loop sequence instead of mature miRNAs
- Only contains data for tumors, cannot perform a comparison of expression in normal tissue versus tumor
- Excellent visualization with a UCSC genome browser type of layout that includes multiple clinical features along with RNA-SEQ expression and methylation probe values
- Single click statistical correlations between features such as
- Sample type (normal or tumor) and expression
- Pathologic stage and lymphocyte infiltration
- BRCA PAM50 subtype and expression
- Only contains expression (RNA-SEQ) and survival data (time to death or last follow-up)
- Displays the results for up to 21 survival analyses at a time
- Allows for interactive Kaplan-Meier plot generation and download of the exact clinical and expression data used for the plot
- This is likely the easiest method out there to get expression of your gene of interest in a cancer
- Contains updated miRNA definitions and includes MiTranscriptome beta lncRNAs
- OncoLnc does not contain any data for normal tissue
- whole genome sequencing
- RNA sequencing
- small RNA sequencing (only BAM)
- Bisulfite sequencing
- ChiP sequencing
- Clinical data
- Processed files for
- RNA-SEQ
- small RNA-SEQ
- protein (RPPA)
- methylation
- SNPs and mutations
- clinical_follow_up_v1.5
- clinical_follow_up_v2.1
- clinical_follow_up_v4.0
- clinical_patient
Code | Definition | Short Letter Code | |
---|---|---|---|
01 | Primary solid Tumor | TP | |
02 | Recurrent Solid Tumor | TR | |
03 | Primary Blood Derived Cancer - Peripheral Blood | TB | |
04 | Recurrent Blood Derived Cancer - Bone Marrow | TRBM | |
05 | Additional - New Primary | TAP | |
06 | Metastatic | TM | |
07 | Additional Metastatic | TAM | |
08 | Human Tumor Original Cells | THOC | |
09 | Primary Blood Derived Cancer - Bone Marrow | TBM | |
10 | Blood Derived Normal | NB | |
11 | Solid Tissue Normal | NT | |
12 | Buccal Cell Normal | NBC | |
13 | EBV Immortalized Normal | NEBV | |
14 | Bone Marrow Normal | NBM | |
20 | Control Analyte | CELLC | |
40 | Recurrent Blood Derived Cancer - Peripheral Blood | TRB | |
50 | Cell Lines | CELL | |
60 | Primary Xenograft Tissue | XP | |
61 | Cell Line Derived Xenograft Tissue | XCL |