This study aims to discover which genes have significantly differential expressions between tumor and normal samples for each single patient . However, the main challenge lies in each gene being measured only once under each condition. In single-subject analyses, conventional analytics are either infeasible or underpowered to detect changes. Therefore, we propose a novel strategy, iDEG (Identifying individualized sets of Differentially Expressed Genes), to overcome this challenge for identifying important genes effectively. The new methodology is then applied to this TCGA dataset and the results are presented in Section \ref{sec:case-study}.

DEG Identification for Single-subject Analysis

The random variables \(Y_{g1}\) and \(Y_{g2}\) are used to denote the expression counts of gene \(g\) under Condition \(1\) (e.g., normal) and Condition \(2\) (e.g., tumor). Furthermore, assume \(\mu_{g1}=E(Y_{g1})\) and \(\mu_{g2}=E(Y_{g2})\), their respective mean expression levels. In single-subject analyses, there is only one sample \(y_{g1}\) and only one sample \(y_{g2}\) observed for \(Y_{g1}\) and \(Y_{g2}\), respectively. The goal is to identify genes whose mean expression is different between the two conditions, i.e., \(\mu_{g1}\neq\mu_{g2}\), for each single subject.
Although there is a body of literature concerning methods for identifying DEGs, very few methods have been developed to identify DEGs without transcriptome replicates. Typically, when no replicates are available, investigators compare an heuristic cutoff value to the absolute difference \(|y_{g2}-y_{g1}|\) or the fold change \(y_{g2}/y_{g1}\), and genes exceeding the cutoff value are declared differentially expressed. The cutoff is usually chosen based on the emipirical experience. \Citeauthorwang-2009-degseq (\citeyearwang-2009-degseq) developed DEGseq, which assumes the expression counts follow a binomial distribution. Based on the binomial distribution, they used a normal distribution to approximate the distribution of the \(\log_{2}\) fold change (\(\log_{2}Y_{g1}-\log_{2}Y_{g2}\)) at a given expression intensity (\(\log_{2}Y_{g1}+\log_{2}Y_{g2}\) ) and calculated a Z-score for each gene. However, DEGSeq is not designed to model over-dispersed count data due to the binomial distribution assumption. \citet{anders-2010-differ-expres} proposed DESeq to discover DEGs with small sample sizes. When neither condition has replicates, DESeq is still applicable but has low power and a high false negative rate. It assumes that most genes are non-differentially expressed and estimates a mean-variance relationship by treating two samples as if they are replicates. Another popular method, edgeR \cite{robinson-2007-small-sampl}, assumes RNA-Seq data follow a negative binomial distribution whose variance is determined only by the value of dispersion with a given mean. Without replicates, edgeR assigns the same value to the dispersion parameter of all genes and conducts a negative binomial (NB) exact test to compute \(p\text{-values}\). Moreover, the value of dispersion is predetermined based on the investigator’s biological knowledge rather than estimated from the data. Therefore, edgeR is not reliable when the assumption of a constant dispersion across genes is invalid or the predetermined value of the dispersion is inaccurate. Overall, there appears to be a lack of work in the literature on individualized DEG identification for single-subject, single-sample RNA-Seq analyses, which can hamper advances in personalized medicine.
In this work, we propose a novel method, called iDEG, to identify individualized Differentially Expressed Genes without requiring transcriptome replicates for either condition. iDEG first applies an appropriate variance-stabilizing transformation (VST) technique to RNA-Seq data such that, under null hypotheses, every gene’s difference between two transformed expression counts approximately follows the same normal distribution with mean zero and a constant variance. This bypasses the estimation of variance for each gene and resolves the constraint of no replicates. Furthermore, iDEG models gene differences using a two-group mixture model and then estimates the probability of differential expression for each gene via empirical Bayes approach. The two groups in the mixture model correspond to differentially and non-differentially expressed genes, and an empirical null distribution is computed from the data.
In practice, investigators sometimes encounter the problem of unequal library sizes—the total starting material (input RNA) sequenced for one transcriptome is more than that for the other transcriptome, i.e., \(\mathrm{E}(Y_{gd})=k_{d}\mu_{gd}\) for \(d=1,2\), where \(k_{d}\) is the library size for samples under condition \(d\) and \(k_{1}\neq k_{2}\). Then, under null hypothesis \(\mu_{g1}=\mu_{g2}\), \(\mathrm{E}(Y_{g1})\neq\mathrm{E}(Y_{g2})\) due to the unequal library sizes. This makes the observed expression counts under two conditions not directly comparable, requiring an extra data normalization step before identifying DEGs. We first develop iDEG for equal library sizes and then extend it to unequal library sizes.
The rest of this article is organized as follows. Section \ref{sec:iDEG-Pois} proposes the iDEG procedure for RNA-Seq data under the framework of Poisson distribution. Section \ref{sec:iDEG-nb} generalizes the iDEG for overdispersion expression counts for the Negative Binomial distribution. A practical issue of unequal library sizes is addressed in Section \ref{sec:unequal-lib-size}. Section \ref{sec:algorithm} describes the computational algorithm and implementation of iDEG. Extensive numerical studies are shown in Section \ref{sec:numerical-sudies} to illustrate the performance of iDEG and compare it with existing methods. Section \ref{sec:sensitivity} demonstrates the robustness of iDEG when model assumptions are violated. Section \ref{sec:case-study} applies iDEG to the TNBC dataset described in Section \ref{sec:data-example}. A final discussion is given in Section \ref{sec:discussion}.