Discovering potential key features of genome wide profiling data using
Decision Variable Analysis
Abstract
The identification of key features related to the phenotype of interest
(POI) from high-dimensional data has been one of the most important
issues for omics-data studies, such as transcriptome or DNA methylome
data. However, these data are commonly contaminated by sources of
unwanted variation caused by platforms, batches or other types of
biological factors. Thus, the data can be considered as a combination of
variation derived from POI and other confounding factors. Not taking
these factors into consideration could lead to spurious associations and
missing important signals. Based on this idea, we propose a novel
feature selection method called Decision Variable Analysis (DVA) to
extract the important features related to POI from the data containing
potential confounding factors. Using this method on the simulated data
and real data, respectively, we found DVA performed better in
identifying confounding factors compared to other methods, including
linear regression and surrogate variable analysis. Especially, our
method is more efficient for the data in which there are much more
feature numbers than sample sizes. We show improvements of DVA across
high-dimensional datasets with smaller sample sizes compared to feature
numbers on different platforms. The results indicate that DVA is an
effective method to dissect sources of variation for omics-data with
potential confounding factors. DVA is freely available for use at
https://github.com/xvon1/DVA.