ROUGH DRAFT authorea.com/25921

RNA-seq analysis

TO DO

• identify data set

• complete outline

• book room?

• come up with quiz questions to identify suitable students

Intro

Library preparation methods

• polyA+

• ribosomal removal

• stranded

Sequencing

• depth vs. replicates

• Illumina (mention ABIsolid, nanopore, PacBio...)

• paired-end vs. single-end

• adapters, contaminations etc.

Experimental Design

control for:

• library batch effects

• barcoding bias

• lane effects: sample loading, cluster amplification, sequencing reaction

Consider spike-in of artificial RNA (ERCC spike-in standard) for calibration of the RNA concentrations in each sample and of the measured fold‐changes between the two conditions (Jiang et al. 2011; Loven et al. 2012)

Recommendations from Schurch et al. (2015) for the design of RNA-seq experiments:

1. At least 6 replicates per condition for all experiments.

2. At least 12 replicates per condition for experiments where identifying the majority of all DE genes is important.

3. For experiments with <12 replicates per condition; use edgeR.

4. For experiments with >12 replicates per condition; use DESeq.

5. Apply a fold‐change threshold appropriate to the number of replicates per condition between 0.1 $$\le threshold \le$$ 0.5.

Gierliński et al. (2015) show that aberrant replicates can skew the entire analysis as a significant fraction of gene counts cannot be captured by the log-normal or negative binomial distributions any longer. It is therefore important to have enough replicates to a) identify outlier samples and b) be able to remove them without losing too much statistical power.

“Even the best tools have limited statistical power with few replicates in each condition, unless a stringent fold‐change threshold is imposed” (Schurch et al., 2015). The inherent biological noise and gene expression variation sets the lower limit for the fold change of DGE that can be detected. The more genes with low fold changes should be detected as part of the experiment, the more replicates are needed to have sufficient data for the estimation of biological variability.

FASTQC

• what do reads “look” like?

• PHRED scores

RSeQC

• genebody coverage

• mapping distribution (genic categories)

• GC content distribution

• NVC plot

• Error rate

• Base quality vs. position in read

Genome alignment

General principles and considerations

• why do we need to align?

• SAM/BAM file format/content

• RNA-seq specialties (splicing etc.)