Why are you not using all of your data? In publications that perform survival analyses, even those from the highest profile journals, I see the researchers splitting the patient data into a training set and a validation set, performing Cox regression on the training set, identifying genes which best classify patients, and then finding that these genes can classify patients in the test set, what a surprise! Cox regression is simply a more complicated version of linear regression, which is necessary because the outcome (usually death) hasn't happened for all patients. Would you split your data into a test and training set if you were performing a Pearson correlation of gene expression and patient age? No, you wouldn't. So why are you voluntarily reducing your statistical power when performing Cox regression? Dividing data into a test and training set is primarily to check if your model is overfitting the data. In the case of univariate regression, this is not a concern because your model could not possibly be any simpler. In the case of multivariate Cox regression, overfitting could be a concern. The general rule of thumb is to have 10-15 data points per variable. If you are pushing this limit you might want to think about test and training sets, but this will reduce your data points and these tests may not reflect how many variables the entire data set could handle, a Catch-22! Another situation where dividing your data might be useful is if you believe your patient population to be heterogeneous and some patients are unduly influencing your model. In this case it could be a good idea to randomly sample the data thousands of times to check the variability in your results and potentially identify the patients which are causing spurious results. So I ask again, why are you doing this? I assume that this originates from taking genes identified in one data set, and seeing if those genes can classify patients in a different data set, which is a completely reasonable thing to do. But if you have one data set there likely isn't a good reason to split it into two smaller data sets. As you may have guessed, OncoLnc uses all available patients with no test or training sets.