Authorea

Richard Smith-Unna try to fix structure so Authorea still functions over 9 years ago

Commit id: 2473ccef139821a4456a8c95124558df5e002b82

deletions | additions

--- :mouse: :reads: :url: - 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR797/SRR797058/SRR797058_1.fastq.gz' - 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR797/SRR797058/SRR797058_2.fastq.gz' :left: - 'SRR797058_1.fastq' :right: - 'SRR797058_2.fastq' :reference: :url: - 'ftp://ftp.ensembl.org/pub/release-78/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz' :fa: - 'Mus_musculus.GRCm38.pep.all.fa'

%w(analysis bless data dependencies khmer soap trinity).each do |file| require "txome-feeding-paper/#{file}" end

module TxomeFeeding # The Analysis class coordinates the entire analysis. It takes care of # downloading data and dependencies, running the read processing, # performing the assemblies, inspecting them, analysing the data, # generating figures, and finally producing the paper. class Analysis def initialize end # Run the analysis def run end end # Analysis end # TxomeFeeding

module TxomeFeeding class Assembler end # Assembler end # TxomeFeeding

module TxomeFeeding # The subsampler class takes care of subsampling the raw reads for an # experiment, generating the various sizes of read subset. class Subsampler def initialize(reads, sizes = [10, 20, 50, 75, 100] seed = 1337) @left = reads[:left] @right = reads[:right] @sizes = sizes @seed = seed end def each @sizes.each do |n| subsetsize = n * 1e6 samplefiles = [] [@left, @right].each do |readfile| samplefile = "#{n}M_#{readfile}" cmd = "seqtk sample" cmd += " -s #{seed}" cmd += " #{readfile}" cmd += " #{subsetsize}" cmd += " > #{samplefile}" process = Cmd.new cmd process.run unless process.status.success? errmsg = "While trying to subset reads file: #{readfile} " + "seqtk command failed to run. Command output: " + "#{process.stdout}\n#{process.stderr}" raise StandardError.new errmsg end samplefiles << samplefile end yield samplefiles end end end # Subsampler end # TxomeFeeding

To demonstrate the merits of our recommendations, a number of assemblies were produced using a variety of methods. Speciifically, all assembly datasets were produced by asembling a publically available 100bp paired-end Illumina dataset (Short Read Archive ID SRR797058, \citep{Macfarlan:2012js}). This dataset was subsetted randomly into 10, 20, 50, 75, and 100 million read pairs as described in \citep{MacManes:2014io}. Reads were error corrected using the software packare \textsc{bless} version 0.16 \citep{Heo:2014cb} and a kmer=19, which was selected based on the developers recommendation. Illumina sequencing adapters were removed from both ends of the sequencing reads, as were nucleotides with quality Phred $\leq$ 2, using the program Trimmmatic version 0.32 \citep{Bolger:2014ek}. The adapter and quality trimmed, error corrected reads were then assembled using Trinity release r20140717 or SOADdenovo-Trans version 1.03. Trinity was employed using default settings, while SOAPdenovo-Trans was employed after optimizing kmer size, [and those other flags i forget right now]. \\ For the purposes of demonstrating To demonstrate the efficacy of using multiple assemblers, in this case Trinity and SOAPdenovo-Trans, we merged assemblies via the following process. [Richard fill in details]. For the assembly generated for the illustration of the shortcomings of length based evaluation, we generated an assembly using Trinity that employed settings purposely designed to increase the length of contigs while sacficing accuracy (--path\_reinforcement\_distance 1 --min\_per\_id\_same\_path 80 --max\_diffs\_same\_path 5 --min\_glue 1). \\ All assemblies Assemblies were characterized using Transrate version 0.31. 1.0.0.beta1. Using this software, we generated three kinds of metrics: contig metrics, metrics; mapping metrics, metrics which used as input the same reads that were fed into the assembler for each assembly; and comparative metrics which used a as input the \textit{Mus musculus} version 75 'all' protein file downloaded from Ensembl. All commands Ensembl for generating the assemblies and downstream analyses are available at []. all assemblies.