ONE: Obtaining training data: The Experimental DesignThefirst work package aims to define the variables pertaining to the experimentand to design the overall procedure and guidelines of obtaining the trainingdata from the patients involved. Because the accuracy of the treatment responseprediction model is very important in such clinical settings, specificallydefining the variables involved in obtaining the training data, i.e. the typeor stage of cancer or the treatment being administered, should be taken intoaccount; as it also helps drive the application of stratified, and evenpersonalized, medicine. For the first task, experimentalvariables are defined as tissues that are sampled from the primary as well assecondary tumors of stage IV colorectal cancer patients undergoing chemotherapeuticagents containing Fluorouracil (5-FU).
Tissues will be sampled before treatment initiationThe second task further defines the type ofinput data to obtain from the tissue samples. The BayCount model will be applied, which factorizes a geneexpression matrix to compute the heterogeneous subclones present in each tissuesample. Using RNA Seq counts and negative binomial analysis, its first computesthe estimated total number of subclone across all samples by means of MaximumLikelihood. Additionally, it calculates the proportion of each subclone inevery sample, as well as the relevant gene expression pattern within eachsubclone while taking into account the systematic variation and gene specificbias, giving a normalized version of the data.Finally,the third task concerns labeling the training data. After the completion of thetreatment plan, the patients’ response will be measured using the RECISTgrading system, where assigned scores are given as a complete response (CD), a partialresponse (PR), a progressive disease (PD) or a stable disease (SD).
Patientsgraded as PD or SD would be sampled again, as done before the treatment, tolabel, and also validate, which of the subclones present before the treatmenthad survived the regimen. These subclones should be labeled as “resistant” andall other subclones that seem to have disappeared after treatment in the samepatient should be labeled as “sensitive”. This is our training data. BecauseBayCount is able to report the subclonal proportions of each patient, we canalso investigate whether the resistance of a subclone depends on its proportionas well as its expression.TWO: Feature selectionand data procesingThe second work package focuses on feature selection, which is aprocedure that narrows down the number of features, in this case genes, to beused as input. Because some of the genes included are irrelevant to theanalysis conditions, their contribution to the “curse of dimensionality”increases computational costs and introduces noise to the data.
Here, the manual selection is done to employ the most relevant genes that wouldyield a good classifier. Accordingly, we can employ prior knowledge in narrowing down thenumber of genes to those involved in the cell cycle, for example, since we arestudying cancerous cells. Commonly mutated genes in cancer are oncogenes like theRAS gene, tumor suppressor genes such as the TP53 gene, and DNA repair genes. Wecan also include the genes that are possibly targeted by the chemotherapy. Inthis case, previous studies has shown that the amplification of the thymidylatesynthase gene has rendered human colon cancer cell lines to be resistant to 5-FUdrugs , whose mechanism depend on acting as a pyrimidineanalog antimetabolite to inhibit the synthesis of deoxythymidine monophosphate (dTMP),eventually interrupting DNA synthesis.
Evidently, we can also include the genes that are known to date tobe useful biomarkers for colorectal cancer, like mutations inthe APC and beta-catenin, both of which are involved in the Wnt signalingpathway, and the BRAF gene which is involved in the MAPK pathway, where stimulation in the first pathwayactivates the other. THREE: Model selectionIn this section, we apply some of the well known machine learningmodels to perform the task of classifying our data. Because the training dataobtained from the patients are already labeled as being either “resistant” or”sensitive”, the learning models can be applied in a supervised manner wherethe algorithm, as opposed to unsupervised learning which aims to exploreunknown classes from the inherent variation of the data, can use theinformation provided as labels to produce a more fitting classification modelfor the patients to whom their treatment response will later be predicted.There are two main tasks in this work package. The first will choseand train different algorithms to accurately classify our data, while thesecond will examine their performance in order to select the best one to applyto our test data.One of the most commonly used supervised algorithms is the supportvector machine or the SVM.
They have the advantage of being able to computeboth linear and non-linear classifications while avoiding over-fitting andretaining its generalization property. It is well supported mathematically andcan perform with high accuracy, especially given a lot of training data. It isalso a discriminative approach to learning: it works best for predictingclasses rather than interpreting the reasons behind the classificationU12 . SVMs only work with labeled data and focuses only on the datapoints, called support vectors, that maximizes the distance between the classesby modeling a separating hyperplane between the two classes.
Another commonly used learning algorithm is the random forest orthe RF approach which, like the SVM, performs with high accuracy. RF is avariation of the decision tree methodology, a rather greedy analysis forclassification or regression tasks. The power of RF lies in that it repeatedlysubsets random samples from the training data, with replacement, as well as thedata parameters or variables and creates numerous trees from those subsets.
Itthen classifies the data based on the individual “votes” or the averaged valueof those week trees, giving a robust result and creating a model that solvesthe problem of over-fitting.