Data filtering and screening
Preliminary cluster analyses revealed retinal contamination in a subset
of our Aripo dataset brain samples. While opsins are expressed at low
levels in the brain, the very high expression levels
(>10,000 copies) in three samples pointed to retinal
contamination. To deal with this issue, we devised a sample filtering
and screening procedure to remove genes in which expression differences
between samples were likely dominated by retinal contamination. Briefly,
we first filtered genes with low expression, then we used contigs
annotated as known retinal genes (Rhodopsin, red/green-sensitive opsins,
blue-sensitive opsins) as seed contigs to identify other
contamination-related transcripts based on high positive correlations of
expression levels with seed genes. We calculated the gene-wise sum of
correlations between candidate genes and seed genes and performed
multiple hypothesis testing using a false discovery rate (FDR)
controlling procedure. The nominal level of FDR was set to α=0.2 to
remove presumptive contaminant contigs. Using this approach, we
identified 1,559 contigs as presumptive retina-enriched genes
(~ 3% of all contigs in our final assembly) which we
removed from both datasets in all subsequent analyses (Table S1). More
detailed descriptions of statistical procedures are in the Supplemental
Methods.