When the library sizes of two samples under comparison are
clearly different, Step 0 is applied to normalize RNA-Seq data prior to implementation. The normalization procedures are described in Section \ref{sec:unequal-lib-size}. In addition, the estimated \(locfdr\) reflects the probability of gene \(g\) being differentially expressed, and the—\citetefron-2001-empir-bayes— have shown its close connection tofalse discovery rate (FDR) controlled by Benjamini and Hochberg procedure \cite{benjamini-1995-controlling}. The algorithm is easy to implement, and the computation is efficient for a large \(G\).
iDEG Algorithm for Negative Binomial Data described in Table \ref{alg}
Remark 1: At Step 3, when there is no prior knowledge or strong evidence to suggest a constant dispersion across genes, the smoothing spline fit should be used. Our simulated experiments show that the smoothing spline can produce a nearly constant \(\hat{\delta}_{g}\) in the constant dispersion case. Furthermore, the linear regression model (\ref{eq:ols-disp}) has slightly better performance when the dispersion is constant, but considerably worse when \(\delta_{g}\) is not a constant across genes.
Remark 2: In most single-subject analyses,
\(\hat{\delta}_{g}\) is small. But in rare cases, when
\(\hat{\delta}_{g}\geq\frac{2}{3}\), the VST \(h_{nb}\) in Step 4
is not numerically stable. To avoid this numerical issue, we suggest replacing the VST \(h_{nb}\) by \(h_{nb}^{*}\) \begin{equation*}—\cite{montgomery-2008-design}—,
h_nb^*(Y_gd) = 1δgsinh^-1Y_gd δ_g g = 1,⋯,G; d = 1,2.
Compared to \(h_{nb}\), \(h_{nb}^{*}\) is less effective in stabilizing variances when \(\mu_{gd}\) is small.