Thesis Proposal


Feature selection and classification algorithms in a high-dimensional setting to predict cancer occurrence using gene expressions

Guiding research question

How to predict the occurrence of specific cancer types based on a multitude of gene expression profiles. Correspondingly, which classification algorithm is ideal given different tissue samples.


This paper is an extension of previous researches involving the investigation of various approaches in classifying cancer tissues. Researchers in the medical field, pathologists, and biostatisticians are the intended primary audience and will benefit from the comparative presentation of algorithms on various cancer-type tissues (leukemia, colon, lung). This will enable researchers to use the appropriate algorithm based on the specific cancer type since the high dimensionality may add noise in the modelling process, and more importantly, it is critical to find a smaller set of gene expressions that are "sufficiently informative to distinguish cells of different types" (Ben-Dor 2000).


Information and Resources

Previous researches on gene expression and cancer classification as applied to the colon cancer and the leukemia data sets will serve as the foundational pieces in this research. A similar set of data involving lung tissues will be used, and classification algorithms developed based on these. Methods on feature selection will be considered, with focus on how R Programming and existing packages can be utilized in a high dimensional setting. Accordingly, this is a simulation experiment research type as described in the SPS Graduate Handbook.

Data Collection

The colon and leukemia data sets are available as described and used in the previous researches listed in the preliminary reference section.
The lung cancer data set will be sourced from the National Center for Biotechnology Information (NCBI).

Overview of the analytical approach

Previous Research Analysis

The author will analyze the existing researches on gene expression cancer classifications, particularly on the methods used (feature selection and machine learning) and the accuracy measures of each.

Application to the Lung Cancer Data

The author will then apply some (or all) of these feature selection and algorithms to a data set consisting of gene expression profiles on normal and cancerous lung tissues.

Develop new machine learning algorithms

Other algorithms (or variants of the current algorithms) will be developed and applied to each of the lung, leukemia, and colon cancer data sets.

Comparative Analysis

The results will be summarized and the accuracy measures compared across data sets.

How the analysis relates to the topic or research question

The objective is to be able to develop machine learning algorithms for the classification of cancers in the lung cancer data set, and how these algorithms compare with the methods used in the leukemia and colon data sets. A comparative analysis of the classification results across the data sets will show which algorithm is ideal for each data type.

Getz, G., Levine, E., Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences 97, 12079-12084.

Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakhini, Z. (2000). Tissue Classification with Gene Expression Profiles. Journal of Computational Biology 7, 559–583.

Brazma, A., Vilo, J. (2000). Gene expression data analysis. FEBS Letters 480, 17–24.

Xing, E. P., Jordan, M. I., Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 601–608). Burlington, MA: Morgan Kaufmann, 2001.