Thesis Proposal


Feature selection and classification algorithms in a high-dimensional setting to predict cancer occurrence using gene expressions

Guiding research question

How to predict the occurrence of specific cancer types based on a multitude of gene expression profiles. Correspondingly, which classification algorithm is ideal given different tissue samples.


This paper is an extension of previous researches involving the investigation of various approaches in classifying cancer tissues. Researchers in the medical field, pathologists, and biostatisticians are the intended primary audience and will benefit from the comparative presentation of algorithms on various cancer-type tissues (leukemia, colon, lung). This will enable researchers to use the appropriate algorithm based on the specific cancer type since the high dimensionality may add noise in the modelling process, and more importantly, it is critical to find a smaller set of gene expressions that are "sufficiently informative to distinguish cells of different types" (Ben-Dor 2000).


Information and Resources

Previous researches on gene expression and cancer classification as applied to the colon cancer and the leukemia data sets will serve as the foundational pieces in this research. A similar set of data inv