Sequence Data that never made it

Over the past several years our lab has experimented with a variety of high-throughput sequencing platforms with the ultimate goal of addressing important scientific questions. On a few occasions we have been successful. Some of our early work included Roche 454 GS-FLX pyrosequencing used to examine phenotypic plasticity in lake trout (FREDERICK 2010), identify single nucleotide polymorphisms in salmon (E. 2011) and develop genomic resources for Pacific herring (B. 2012).

  • Morera D, Roher N, Ribas L, Balasch JC, Doñate C, Callol A, Boltaña A, Roberts SB, Goetz G, Goetz FW, Mackenzie SA. (2011) RNA-Seq reveals an integrated immune response in nucleated erythrocytes. PLoS ONE 6(10): e26998. doi:10.1371/journal.pone.0026998

  • Gavery MR* and Roberts SB. (2012) Characterizing short read sequencing for gene discovery and RNA-Seq analysis in Crassostrea gigas. Comparative Biochemistry and Physiology Part D: Genomics and Proteomics, 7:2 94-99 June 2012. doi:10.1016/j.cbd.2011.12.003

  • Burge CA, Douglas N, Conti-Jerpe I, Weil E, Roberts SB, Friedman CS, CD Harvell. (2012) Friend or foe: the association of Labyrinthulomycetes with the Caribbean sea fan, Gorgonia ventalina. Diseases of Aquatic Organisms. 101:1-12 doi:10.3354/dao02487

  • Gavery MR, Roberts SB. (2013) Predominant intragenic methylation is associated with gene expression characteristics in a bivalve mollusc. PeerJ 1:e215 doi:10.7717/peerj.215

However there are a number of efforts that have failed early on (ie poor read quality) or just never gained enough momentum to see the light of the peer-review system. This paper is a final attempt to share these datasets in a manner where they have the potenial, however slim, to be useful to someone. We will not be providing fluffy background or in-depth over-analysis but strive to offer a sound data description. This current work will be organized by target species, which will often correspond to project.

Species include

  • Vibrio tubiashii
  • Olympia oyster (Ostrea lurida)
  • Yellow Perch (Perca flavescens)
  • Pacific herring (Clupea pallasii)
  • Abalone / Withering Syndrome (WS)
  • Hard clam (Mercenaria mercenaria)

#Vibrio tubiashii

According to the our records the first library we ever sequenced was from Vibrio tubiashi. The experiment compared transcriptomic changes in V.tubiashii exposed to autoclaved C.gigas and those that were not. Samples were incubated in sterile seawater (with or without autoclaved C.gigas) for 24hrs. at room temperature, at a concentration of 9.975x10^11 CFU/mL. After 24hrs., V.tubiashii concentrations were calculated and two 50mL volumes were collected from each treatment. Cells were pelleted, supernatant removed and the pellets were stored @ -80C.

Total RNA was isolated from the control (5.63 x 10^11 CFU) and exposed (1.835 x 10^12 CFU) V.tubiashii samples using 10mL of TriReagent. The remainder of the manufacturer's protocol was scaled appropriately. Ribosomal RNA (rRNA) was removed from each sample using the MICROBExpress Kit (Ambion), accoriding to the manufacturer's protocol. Removal of rRNA from each sample was verified via formaldehyde-HEPES agarose gel and compared with untreated total RNA from each sample. Submitted just the C.gigas-exposed V.tubiashii mRNA for single-end Illumina sequencing at the High Throughput Genomics Unit (HTGU; University of Washington).

Download Link:

As part of a Saltonstall-Kennedy Grant funded project entitled "Ocean acidification and emerging diseases in the Pacific Northwest" genomic libraries were generated and sequenced. Two different strains were partially sequenced in an effort to learn what genetic differences might be associated with different phenotypes (ie growth) under ocean acidification conditions.

The next occurance of this taxa was not until years later with two genomic DNA libraries: RE22 and ATCC 19106. These were SOLiD libraries sequenced in early 2011.

This effort was part of the thesis work of Elene Dorfmeier. The thesis, "Ocean acidification and disease: How will a changing climate impact Vibrio tubiashii growth and pathogenicity to Pacific oyster larvae?" is available online

A detailed description of her work with respect to these libaries can be found in her thesis . Page-- Raw data is available in SRA?? and a lot of secondary analysis files are available via figshare.

Communication with Elene (3/14/2014) says she has not submitted sequences to NCBI SRA.

#Ostrea lurida (Olympia oyster)
Olympia oyster sequencing efforts have spawned from studies examining the influence of ocean acidification and a study examining local adaptation in Puget Sound. As of March 2014 the libraries constructed and sequencing data include the following

ID Stage Notes
Ol-larv 400_1 larvae 0112; 012159; 36SE
Ol-larv 2000_1 larvae 0112; 012159; 36SE
Ol-larv 400_2 larvae 0812; 103939; 36SE
Ol-larv 1000_2 larvae 0812; 103939; 36SE
Ol-larv 1600_2 larvae 0812; 103939; 36SE
Ol-larv 2200_2 larvae 0812; 103939; 36SE
Ol-Femalemix_106a gonad 36SE
Ol-Malemix_106a gonad 36SE
Ol-Femalemix_108a gonad 36SE
Ol-Malemix_108a gonad 36SE
Ol-Femalemix_106A gonad 72PE
Ol-Malemix_106A gonad 72PE
Ol-Femalemix_108A gonad 72PE
Ol-Malemix_108A gonad 72PE

Libraries Ol-larv 400-1 and Ol-larv 2000-1 were described as part of the publication:

Timmins-Schiffman, E. B., Friedman, C. S., Metzger, D. C., White, S. J. and Roberts, S. B. (2013), Genomic resource development for shellfish of conservation concern. Molecular Ecology Resources, 13: 295–305. doi: 10.1111/1755-0998.12052

Larvae were transferred to the University of Washington 12 h post spawning. Larvae (12 larvae/mL) were evenly distributed to six 4.5-L larval chambers. Larvae were sampled from all chambers by filtering them onto a 35 μm screen and flash freezing in liquid N2 on days one, two and three post-fertilization. Two RNA-seq libraries (Ol-larv 400_1 and Ol-larv 2000_1) were constructed from pooled mRNA (13 μg per sample).

Raw data is available in the following locations


In late 2012, four more libraries were made and sequenced, however this time these libraries were all run in a single lane on the Illumina HiSeq.

Ostrea lurida larvae were subjected to four different pCO2 treatments (400, 1000, 1600 and 2200ppm) from a Friedman Lab ocean acidification experiment. RNA was isolated from three groups of larvae from each pCO2 treatment using TriReagent (Molecular Research Center) according to the manufacturer's protocol. RNA was resuspended in 100uL of 0.1% DEPC-treated H2O. Concentrations and quality (OD260/280 ratio) were assessed with a NanoDrop1000 (ThermoFisher). Five micrograms of total RNA from each larval group within each pCO2 treatment were pooled. Four total RNA pools, each pool representing each pCO2 treatment, were submitted for single-end Illumina sequencing at the High Throughput Genomics Unit (HTGU; University of Washington) for sequencing. All four samples were sequenced together in a single lane.

Data is available in the following locations

In order to characterize the reproductive transcriptome of the Olympia oyster four libraries were made in from pooled gonad samples. The IDs were 106A_Female, 106A_Male, 108A_Female, 108A_Female.

These libraries were done in late 2012 and were sequenced on the Illumina HiSeqplatform on a 72PE run. An additional lane was run as a bonus as 36SE.

Raw data is available in the following locations

#Mercenaria mercenaria (Hard Clam)

##The early library
Very early on, maybe the second library we sequenced was from the hard clam. This was done on the Illumina platform and the facility tube label (ftls) was "Illumina clam mRNA SEQ 21.3ng/uL 200bp".

####Sample Collection
Hard clam (Mercencaria mercenaria) seed from two different broodstock sources were planted in Scudder’s Lane, Massachusetts in Fall 2008. One broodstock was obtained from Barnstable Harbor in Massachusettes and had previously been exposed to a severe outbreak of QPX in . The second broodstock was obtained from Mashpee, Massachusettes where no outbreaks of QPX had been reported. The second broodstock cohort was obtained from Mashpee, Massachusetts where there were no reported incidences of QPX. Seed clams from both broodstock cohorts were planted together in 4 separate plots. The cohorts will be and will be referred to as BARN and MASH, respectively. Shell size and mortality was assessed by sampling clams (n=X) on 5 sampling dates over a 16 month period. In June and August of 2010, 40 clams were harvested from each cohort for histological analysis. In August 2010, gill tissue was removed from a subset of clams (n=16) using sterile procedures and stored in RNAlater. RNA was extracted from the gill tissue using TriReagent (Molecular Research Center) following manufacturers recommended protocol and stored at -80 for RNA-seq analysis.

####RNA-Seq Sample Preparation and Analysis
Total RNA samples from eight individuals from each cohort (BARN and MASH) were pooled in equal quantity. Samples were enriched for mRNA using the MicroPoly(A) Purist Kit (Ambion). Library preparation and sequencing was conducted by the University of Washington High Throughput Genomics Unit (HTGU) on the SOLiD 4 System (Applied Biosystems).

All sequence analysis was performed with CLC Genomics Workbench version 4.0 (CLC Bio). Initially, sequences were trimmed based on a quality scores of 0.05 (Phred; Ewing, Green, 1998; Ewing et al., 1998) and the number of ambiguous nucleotides (>2 on ends). Sequences smaller than 20 bp were also removed. Quality trimmed reads from both libraries (BARN and MASH) were de novo assembled using following parameters: limit=8, mismatch cost=2, and minimum contig size of 200 bp. Consensus sequences were compared to the UniProtKB/Swiss-Prot database ( in order to determine putative description. Comparisons were made using the BLASTx algorithm (Altschul et al 1997). Associated Gene Ontology (GO) terms were used to classify sequences based on biological process as well as categorize genes into parent categories (GO slim).

RNA-seq analysis was performed to determine differential gene expression patterns between the BARN and MASH libraries using the de novo assembly as a reference (CLC Genomic Workbench v4.0, CLC Bio). Expression values were measured in RPKM (reads per kilobase of exon model per million mapped reads, see [Mortazavi et al., 2008]). Parameters for RNA-seq analysis included; unspecific match limit = 10, maximum number of mismatches = 2, minimum number of reads = 10. Differentially expressed genes were identified as having > 2 fold change in RPKM expression values and a significance of p<0.05. Statistical comparison of RPKM values between the BARN and MASH libraries was carried out using Baggerley’s test (Baggerly et al., 2003).

Significantly enriched GO terms were identified using the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 ( (Huang et al., 2009a; Huang et al., 2009b). The UniProt accession numbers for differentially expressed genes were uploaded as a gene list while all UniProt accession numbers for annotated contigs were used as a background. Only contigs with a corresponding e-value <1.00E-02 were used in this analysis. Significantly enriched GO terms were identified as those with a p-value of p<0.05. UniProt accession numbers for significantly enriched GO terms were extracted for further analysis. Additional visualization of enriched GO terms was carried out using Revigo (Supek et al. 2011).

####Restriction site associated DNA (RAD) marker library preparation
Restriction site associated DNA (RAD) marker libraries were constructed to identify diagnostic markers among cohorts. Genomic DNA was isolated separately from the gill tissue BARN (n=4) and MASH (n=4) clams using DNAzol (Molecular Research Center) as per manufacturers recommendations. Libraries were prepared as described by Miller et al 2007. Briefly samples (n=8) were digested Sbf-1 (New England Biolabs), then each hybridized with a unique barcode, and RAD adapters (PI and P2) were ligated on DNA fragments. Size selection of DNA fragments was achieved by running PCR on a 1% EZ gel (Invitrogen) with E-gel 1 kb Plus DNA ladder followed by purification using the MiniElute gel purification protocol. Subsequent library construction and sequencing was carried out by the University of Washington High Throughput Genomics Unit (HTGU) using the Illumina HiSeq2000 system.

####Restriction site associated DNA (RAD) marker library analysis
Initial sequence read processing of RAD tags was carried out as previously described by Miller et (2012). Quality scores were used to remove raw sequencing reads with a probability of sequencing error greater than 10%. Using custom perl scripts (Miller et al. 2012) we then grouped raw sequences reads by individual and removed barcodes and restriction sites for a total sequence read length of 24 base pairs.

Two types SNP analyses were performed including population specific SNP variation characterization and the identification of SNPs that could potentially distinguish populations. In order to examine population specific SNP variation quality trimmed reads from each cohort (BARN and MASH) were assembled independently using the following parameters; limit = 8, and mismatch cost = 2 (Genomics Workbench 4.0; CLC Bio). SNP detection was carried out using the following parameters: maximum gap and mismatch count = 2, minimum average quality = 15, minimum central quality = 20, minimum coverage = 10, minimum variant frequency = 35% (Genomics Workbench 4.0; CLC Bio).

For the second form of SNP analysis Novoindex and Novoalign (Novocraft Technologies) were used to aseemble RAD-tags to identify RAD-tags within a cohort that were identical (lacked any polymorphisms. These “isotigs” from each cohort were then compared by assembling reads and carrying out SNP detection as described above. Any SNP that was identified indicated that the locus is fixed for the individuals in each cohort examined.

####Transcriptome sequencing
Sequencing of the hard clam transcriptome yielded a total of 72,352,632 and 58,578,559 reads from the BARN and MASH libraries, respectively. All data are available in the NCBI Short Read Archive database (Accession # SUB001209). After quality trimming, 50,873,441 and 43,972,311 reads remained from the BARN and MASH libraries, respectively, with an overall average length of 36 bp. De novo assembly resulted in 59% of the reads assembling into 8,482 contigs with an N50 value of 250, and an average size of 259 bp. All contigs can be found in the Supplementary File S1 (fasta format). A total of 2,437 contigs were annotated based on the Swiss-Prot database. Of those contigs with associated GO descriptions, the two most represented biological processes include RNA metabolism and transport (Figure 1). Specific contigs that were determined to be associated with a stress response based on GO ontology can be found in Supplementary Table S2: lab notebook.


  1. Roberts Steven B., Lorenz Hauser, Lisa W. Seeb, James E. Seeb. Development of Genomic Resources for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7, e30908 Public Library of Science (PLoS), 2012. Link

  2. SEEB J. E., C. E. PASCAL, E. D. GRAU, L. W. SEEB, W. D. TEMPLIN, T. HARKINS, S. B. ROBERTS. Transcriptome sequencing and high-resolution melt analysis advance single nucleotide polymorphism discovery in duplicated salmonids. Molecular Ecology Resources 11, 335–348 Wiley-Blackwell, 2011. Link

  3. GOETZ FREDERICK, DANIEL ROSAUER, SHAWN SITAR, GILES GOETZ, CRYSTAL SIMCHICK, STEVEN ROBERTS, RONALD JOHNSON, CHERYL MURPHY, CHARLES R. BRONTE, SIMON MACKENZIE. A genetic basis for the phenotypic differentiation between siscowet and lean lake trout ( Salvelinus namaycush ). Molecular Ecology 19, 176–196 Wiley-Blackwell, 2010. Link

[Someone else is editing this]

You are editing this file