Project workings

All of the data used in this study was downloaded from https://tcga-data.nci.nih.gov/tcga/. I first went through each cancer and checked if there were enough patients with clinical data and enough deaths to warrant inclusion in the study. For example, a cancer such as prostate adenocarcinoma has a lot of data, but too few deaths to perform a meaningful survival analysis. I know that data friendly formats can be found at the Broad's Firehose, but with how complicated it is to parse the clinical data, I have no way to know that the Firehose has parsed the data correctly. This may sound paranoid, but I have good reason to believe cBioPortal does not parse the clinical data correctly (see below), so I have a reason not to trust third party sites. The issue with the clinical data is the files are a little bit of a mess. When you download clinical data from the TCGA, you get multiple files. The "clinical_follow_up" file sounds like the file that should contain all the survival information, and it does contain survival information, just not all of it. Sometimes there can be multiple follow_up files, and these will contain nonredundant information. What's worse is even within one file a patient can be listed 2 or more times, so care has to be taken to use the most recent information. To make matters even more complicated, there is a "clinical_patient" file which also contains survival information, which may or may not be redundant information. I took care in each cancer to parse the files in such a way as to ensure that I had the most up to date survival information as possible. I believe cBioPortal may not do this correctly because for at least several months it was not possible to make survival curves for LAML. At the time this cancer did not have any follow_up files, but did have survival information "hidden" in the clinical_patient file, which is what I used for my analysis. Once all the data is parsed correctly you then have to decide how to calculate a gene's correlation with survival. cBioPortal uses Kaplan Meier curves, which are likely the best method of visually showing a survival effect, but they have multiple issues. First off, you lose a lot of statistcal power by dividing the patients into only two groups. If you divide the patients down the middle, then a patient with an average expression of a gene is in the same group as a patient with an extreme expression of the gene. In addition, Kaplan Meier plots do not account for confounders. For example, if a gene's expression correlates with age or grade then a Kaplan Meier plot would falsely indicate that it is correlated with survival. As a result, in the field the standard method for performing survival analysis is Cox regression, which allows for the magnitude of a gene's expression to be considered along with covariates such as age. Cox regression has been used extensively in the literature to find prognostic genes for various cancers, but the majority of these studies have used microarray data and not RNA-SEQ. Microarray data is usually log2 transformed and has a low range of values, for example a gene's expression range might be 6-10. In contrast, RNA-SEQ can have an extreme range of values, with a gene's expression ranging from 0 to hundreds of thousands of reads. In regression extreme values can have an unduly large influence on the model, which poses a problem for RNA-SEQ data. As a result, I performed a transformation which standard normalized the data, which can be referred to as a Blom or inverse normal transformation. In my models I did not transform the age variable, and for grades I used separate terms which were either 1 or 0, as is standard practice. The Cox regression was implemented with the R survival library (and rpy2 to automate the process). I was interested in the effect the expression of the gene had on survival independent of the covariates, so I recorded the Cox coefficient for each gene's expression term, and it's associated p-value, which is derived from the standard error for the coefficient. With these lists of Cox coefficients and p-values, I was able to identify the most protective and harmful genes in each cancer. The nice thing about a Cox coefficient is that it can be positive or negative, which tells you its contribution to the hazard function. In short, a positive value would indicate that expression of the gene increases the chance that the patient will suffer an event (death), while a negative value would indicate the gene is protective. When studying prognostic genes it seemed to make sense to separate them depending on their Cox coefficient because it is more likely genes in the same pathway would have similar contributions to survival. And when I clustered patients with gene expression drastic differences in expression patterns between protective and harmful genes were seen (Fig. 1b). As a result, I found gene sets with MSigDB separately for protective and harmful genes, and these were also very different for each cancer (Fig. 2b). Comparisons of the 16 cancers with these gene sets showed groups of cancers which shared large numbers of gene sets. I wondered if this result could be reproduced by an independent method, so I came up with the idea of clustering cancers with Cox coefficients. The range of values for Cox coefficients can vary between cancers, for example it might be -.4 to .4 in one cancer, but -.9 to .9 in another. To adjust for this variation I decided to normalize Cox coefficients before clustering. Clustering of cancers with the Cox coefficients of individual genes (Fig. 3) or the average Cox coefficients of pathways (Fig. 4), clustered cancers which had shared large numbers of gene sets, indicating a replicable similarity amongst these cancers. What it means to be a prognostic gene is difficult to answer. The most boring possibility is that the gene is simply a passenger, i.e. it is regulated the same as a gene that is having an effect on survival. If a gene does truly have an effect on survival, it is impossible to know if that is due to an intrinsic effect on pathogenesis, or if the gene's effect on survival is in response to the treatment the patients are receiving. As a result, the similarities I found among cancers can either be due to similarities in pathogenesis, responses to treatment, some combination of these, or other possibilities I have not considered. I hope my work gets researchers to think more about why certain genes are associated with survival, and what this can reveal about cancers.