How and why this project started In today's funding and publication environment, it is important to put a clinical spin on your work, i.e., why is the protein you're studying important. And with resources that allow exploration of clinical data such as cBioPortal, it has never been easier to find an association of a gene with a disease, and it is becoming common to find Kaplan Meier plots copied and pasted into articles to show the clinical relevance of the study. Typically what is done is a researcher will look through the various cancers on cBioPortal, and play with different expression cutoffs until they get a p-value below .05, and of course they won't correct for the number of fidangles it took to get there. Scary stuff, I know. More important than not correcting for the multiple fidangles however, is the use of the logrank p-value from cBioPortal in the first place. A p-value is only meaningful if the null hypothesis is set up correctly, and while researchers are treating these p-values as showing their gene has an effect on survival, all the logrank p-value is saying is that the two groups being compared are different. It is not possible to set up a hypothesis to test whether a gene is responsible for cancer progression, but we can at the very least check how the gene we're studying compares to other genes in terms of correlation to survival. If our gene is the most highly correlated gene to survival then maybe the correlation does mean our gene has a role in cancer, if it's the 1000th most correlated gene then it becomes less exciting. And if the clinical relevance of our work is so important, aren't we going about this backwards? Shouldn't we start off by studying the genes that are most correlated to survival instead of abusing statistics to retroactively show that the gene our lab studies is important? With this in mind I went to check if there was a resource where I might find all of the correlations to survival for TCGA data. Surprisingly, I could not find these lists. Even when a third party site like the Broad's Firehose says it performed Cox regression, it lists 0 genes as correlated with survival. I was familiar with TCGA data, so I went ahead and started downloading data for all the cancers and performing Cox regressions, which you can read more about in the next post.