Unequal Library Sizes

The problem of unequal library sizes is commonly encountered in practice. If the total starting material (input RNA) sequenced for one transcriptome is different from that of the other transcriptome, then the observed expression counts under two conditions are not directly comparable. A normalization step is necessary prior to testing \(E(Y_{g1})=E(Y_{g2})\) for any \(g\).

Normalizing Poisson Data

For Poisson distribution, unequal library sizes are accounted for by normalizing the data to reads per million (RPM; \citealt{mortazavi-2008-mappin-quant}). The normalized gene count \(Y_{gd}^{*}\) is given by \(Y_{gd}^{*}=\frac{Y_{gd}}{\sum_{g=1}^{G}Y_{gd}}\times 10^{6}\) for all \(g\), and \(d=1,2\). After normalization, one can test \(E(Y_{g1}^{*})=E(Y_{g2}^{*})\). Web Figure 1 (Panels A and B) of the supplementary material illustrates the effect of normalization. Before normalization, the median of expression levels in one transcriptome is far away from that of the other transcriptome; after normalization, the two medians are approximately at the same level and comparable.
Since the normalized data no longer follow a Poisson distribution, the transformation \(h_{Pois}(\cdot)\) in (\ref{eq:tPois}) cannot be directly applied to \(Y_{gd}^{*}\). Therefore, a different transformation for \(Y_{gd}^{*}\) is needed to stabilize variance. Recall that if \(Y\sim\text{Poisson}(\mu)\), then \(\sqrt{Y}\overset{\cdot}{\sim}\text{N}(\sqrt{\mu},\frac{1}{4})\) \cite{anscombe-1948-trans-poiss}. Using this fact, we propose the transformation \(\sqrt{Y_{gd}^{*}}\), which is shown to approximately follow a normal distribution with a constant variance across all the genes in Corollary 1.
Assume \(Y_{gd}\sim\mbox{Poisson}(\mu_{gd})\) for \(g=1,\ldots,G;d=1,2\), and they are all independent. Denote the library size for samples under condition \(d\) by \(k_{d}\). Then
\begin{equation} \sqrt{Y_{gd}^{*}}\overset{\cdot}{\sim}\text{N}(\tilde{\mu}_{gd},\tilde{\sigma}_{d}^{2}),\quad g=1,\ldots,G;d=1,2,\nonumber \\ \end{equation}
where
\begin{equation} \tilde{\mu}_{gd}=\sqrt{\frac{\mu_{gd}}{\sum_{g=1}^{G}\mu_{gd}}}\times 10^{3},~{}~{}\tilde{\sigma}_{d}^{2}=\frac{1}{4}\frac{1}{k_{d}\sum_{g=1}^{G}\mu_{gd}}\times 10^{6}.\nonumber \\ \end{equation}
Corollary \ref{cor:stabilized-var-norm-pois} indicates \(\tilde{\sigma}_{d}^{2}\) does not depend on \(g\). The proof is given in the Web Appendix A. Therefore, under the null hypothesis, we have
\begin{equation} D_{g}^{*}=\sqrt{{\color[rgb]{0,0,0}Y_{g1}^{*}}}-\sqrt{{\color[rgb]{0,0,0}Y_{g2}^{*}}}\overset{\cdot}{\sim}N({\color[rgb]{0,0,0}0},\sqrt{\tilde{\sigma}_{1}^{2}+\tilde{\sigma}_{2}^{2}}),\quad\forall g\in\overline{\mathcal{G}}.\nonumber \\ \end{equation}
It is noted that \(D_{g}^{*}\) follows a normal distribution with a common variance \(\tilde{\sigma}_{1}^{2}+\tilde{\sigma}_{2}^{2}\) across all of the null genes. Following this, the proposed iDEG procedure can be applied next.

Normalizing Data from Negative Binomial

When RNA-Seq data follow the NB distribution, we propose applying a quantile adjustment procedure by Robinson and Smyth (\citeyearrobinson-2007-small-sampl) for normalization. Specifically, this technique adjusts the observed expression counts up if the library size is below the geometric mean and vice versa. For the null genes, this procedure creates pseudo-data that follow approximately an identical NB distribution. In Figure S1, the bottom row shows roughly equal library sizes of the pseudo-data after normalization. Since the pseudo-data follow NB, the VST \(h_{nb}(\cdot)\) in (\ref{eq:NBvst}) can be applied, which is then followed by the iDEG for NB data (Section \ref{sec:iDEG-nb}).