Unequal Library Sizes
The problem of unequal library sizes is commonly encountered in practice. If the total starting material (input RNA) sequenced for one transcriptome is different from that of the other transcriptome, then the observed expression counts under two conditions are not directly comparable. A normalization step is necessary prior to testing \(E(Y_{g1})=E(Y_{g2})\) for any \(g\).
Normalizing Poisson Data
For Poisson distribution, unequal library sizes are accounted for by normalizing the data to reads per million (RPM; \citealt{mortazavi-2008-mappin-quant}). The normalized gene count \(Y_{gd}^{*}\) is given by \(Y_{gd}^{*}=\frac{Y_{gd}}{\sum_{g=1}^{G}Y_{gd}}\times 10^{6}\) for all \(g\), and \(d=1,2\). After normalization, one can test \(E(Y_{g1}^{*})=E(Y_{g2}^{*})\). Web Figure 1 (Panels A and B) of the supplementary material illustrates the effect of normalization. Before normalization, the median of expression levels
in one transcriptome is far away from that of the other transcriptome; after normalization, the two medians are approximately at the same level and comparable.
Since the normalized data no longer follow a Poisson distribution, the transformation \(h_{Pois}(\cdot)\) in (\ref{eq:tPois}) cannot be directly applied to \(Y_{gd}^{*}\). Therefore, a different transformation for \(Y_{gd}^{*}\) is needed to stabilize variance. Recall that if \(Y\sim\text{Poisson}(\mu)\), then \(\sqrt{Y}\overset{\cdot}{\sim}\text{N}(\sqrt{\mu},\frac{1}{4})\) \cite{anscombe-1948-trans-poiss}. Using this fact, we propose the transformation \(\sqrt{Y_{gd}^{*}}\), which is shown to approximately follow a normal distribution with a constant variance across all the genes in Corollary 1.
Assume \(Y_{gd}\sim\mbox{Poisson}(\mu_{gd})\) for \(g=1,\ldots,G;d=1,2\), and they are all independent. Denote the library size for samples under condition \(d\) by \(k_{d}\). Then
\begin{equation}
\sqrt{Y_{gd}^{*}}\overset{\cdot}{\sim}\text{N}(\tilde{\mu}_{gd},\tilde{\sigma}_{d}^{2}),\quad g=1,\ldots,G;d=1,2,\nonumber \\
\end{equation}
where
\begin{equation}
\tilde{\mu}_{gd}=\sqrt{\frac{\mu_{gd}}{\sum_{g=1}^{G}\mu_{gd}}}\times 10^{3},~{}~{}\tilde{\sigma}_{d}^{2}=\frac{1}{4}\frac{1}{k_{d}\sum_{g=1}^{G}\mu_{gd}}\times 10^{6}.\nonumber \\
\end{equation}
Corollary \ref{cor:stabilized-var-norm-pois} indicates \(\tilde{\sigma}_{d}^{2}\) does not depend on \(g\). The proof is given in the Web Appendix A. Therefore, under the null hypothesis, we have
\begin{equation}
D_{g}^{*}=\sqrt{{\color[rgb]{0,0,0}Y_{g1}^{*}}}-\sqrt{{\color[rgb]{0,0,0}Y_{g2}^{*}}}\overset{\cdot}{\sim}N({\color[rgb]{0,0,0}0},\sqrt{\tilde{\sigma}_{1}^{2}+\tilde{\sigma}_{2}^{2}}),\quad\forall g\in\overline{\mathcal{G}}.\nonumber \\
\end{equation}
It is noted that \(D_{g}^{*}\) follows a normal distribution with a common variance \(\tilde{\sigma}_{1}^{2}+\tilde{\sigma}_{2}^{2}\) across all of the null genes. Following this, the proposed iDEG procedure can be applied next.
Normalizing Data from Negative Binomial
When RNA-Seq data follow the NB distribution, we propose applying a quantile adjustment procedure by Robinson and Smyth (\citeyearrobinson-2007-small-sampl) for normalization. Specifically, this technique adjusts the observed expression counts up if the library size is below the geometric mean and vice versa. For the null genes, this procedure creates
pseudo-data that follow approximately an identical NB distribution. In Figure S1, the bottom row shows roughly equal library sizes of the pseudo-data after normalization. Since the pseudo-data follow NB, the VST
\(h_{nb}(\cdot)\) in (\ref{eq:NBvst}) can be applied, which is then followed by the iDEG for NB data (Section \ref{sec:iDEG-nb}).