ROUGH DRAFT authorea.com/104100
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Thesis Proposal

    Title

    Feature selection and classification algorithms in a high-dimensional setting to predict cancer occurrence using gene expressions

    Guiding research question

    How to predict the occurrence of specific cancer types based on a multitude of gene expression profiles. Correspondingly, which classification algorithm is ideal given different tissue samples.

    Significance

    This paper is an extension of previous researches involving the investigation of various approaches in classifying cancer tissues. Researchers in the medical field, pathologists, and biostatisticians are the intended primary audience and will benefit from the comparative presentation of algorithms on various cancer-type tissues (leukemia, colon, lung). This will enable researchers to use the appropriate algorithm based on the specific cancer type since the high dimensionality may add noise in the modelling process, and more importantly, it is critical to find a smaller set of gene expressions that are "sufficiently informative to distinguish cells of different types" (Ben-Dor 2000).

    Methodology

    Information and Resources

    Previous researches on gene expression and cancer classification as applied to the colon cancer and the leukemia data sets will serve as the foundational pieces in this research. A similar set of data involving lung tissues will be used, and classification algorithms developed based on these. Methods on feature selection will be considered, with focus on how R Programming and existing packages can be utilized in a high dimensional setting. Accordingly, this is a simulation experiment research type as described in the SPS Graduate Handbook.

    Data Collection

    The colon and leukemia data sets are available as described and used in the previous researches listed in the preliminary reference section. The lung cancer data set will be sourced from the National Center for Biotechnology Information (NCBI).

    Overview of the analytical approach

    Previous Research Analysis

    The author will analyze the existing researches on gene expression cancer classifications, particularly on the methods used (feature selection and machine learning) and the accuracy measures of each.

    Application to the Lung Cancer Data

    The author will then apply some (or all) of these feature selection and algorithms to a data set consisting of gene expression profiles on normal and cancerous lung tissues.

    Develop new machine learning algorithms

    Other algorithms (or variants of the current algorithms) will be developed and applied to each of the lung, leukemia, and colon cancer data sets.

    Comparative Analysis

    The results will be summarized and the accuracy measures compared across data sets.

    How the analysis relates to the topic or research question

    The objective is to be able to develop machine learning algorithms for the classification of cancers in the lung cancer data set, and how these algorithms compare with the methods used in the leukemia and colon data sets. A comparative analysis of the classification results across the data sets will show which algorithm is ideal for each data type.

    Getz, G., Levine, E., Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences 97, 12079-12084.

    Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakhini, Z. (2000). Tissue Classification with Gene Expression Profiles. Journal of Computational Biology 7, 559–583.

    Brazma, A., Vilo, J. (2000). Gene expression data analysis. FEBS Letters 480, 17–24.

    Xing, E. P., Jordan, M. I., Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 601–608). Burlington, MA: Morgan Kaufmann, 2001.