Richard Smith-Unna try to fix structure so Authorea still functions  over 9 years ago

Commit id: 2473ccef139821a4456a8c95124558df5e002b82

deletions | additions      

                                         

---  :mouse:  :reads:  :url:  - 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR797/SRR797058/SRR797058_1.fastq.gz'  - 'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR797/SRR797058/SRR797058_2.fastq.gz'  :left:  - 'SRR797058_1.fastq'  :right:  - 'SRR797058_2.fastq'  :reference:  :url:  - 'ftp://ftp.ensembl.org/pub/release-78/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz'  :fa:  - 'Mus_musculus.GRCm38.pep.all.fa'                                   

%w(analysis bless data dependencies khmer soap trinity).each do |file|  require "txome-feeding-paper/#{file}"  end           

module TxomeFeeding  # The Analysis class coordinates the entire analysis. It takes care of  # downloading data and dependencies, running the read processing,  # performing the assemblies, inspecting them, analysing the data,  # generating figures, and finally producing the paper.  class Analysis  def initialize  end  # Run the analysis  def run  end  end # Analysis  end # TxomeFeeding           

module TxomeFeeding  class Assembler  end # Assembler  end # TxomeFeeding                                         

module TxomeFeeding  # The subsampler class takes care of subsampling the raw reads for an  # experiment, generating the various sizes of read subset.  class Subsampler  def initialize(reads,  sizes = [10, 20, 50, 75, 100]  seed = 1337)  @left = reads[:left]  @right = reads[:right]  @sizes = sizes  @seed = seed  end  def each  @sizes.each do |n|  subsetsize = n * 1e6  samplefiles = []  [@left, @right].each do |readfile|  samplefile = "#{n}M_#{readfile}"  cmd = "seqtk sample"  cmd += " -s #{seed}"  cmd += " #{readfile}"  cmd += " #{subsetsize}"  cmd += " > #{samplefile}"  process = Cmd.new cmd  process.run  unless process.status.success?  errmsg = "While trying to subset reads file: #{readfile} " +  "seqtk command failed to run. Command output: " +  "#{process.stdout}\n#{process.stderr}"  raise StandardError.new errmsg  end  samplefiles << samplefile  end  yield samplefiles  end  end  end # Subsampler  end # TxomeFeeding                     

To demonstrate the merits of our recommendations, a number of assemblies were produced using a variety of methods. Speciifically, all assembly datasets were produced by asembling a publically available 100bp paired-end Illumina dataset (Short Read Archive ID SRR797058, \citep{Macfarlan:2012js}). This dataset was subsetted randomly into 10, 20, 50, 75, and 100 million read pairs as described in \citep{MacManes:2014io}. Reads were error corrected using the software packare \textsc{bless} version 0.16 \citep{Heo:2014cb} and a kmer=19, which was selected based on the developers recommendation. Illumina sequencing adapters were removed from both ends of the sequencing reads, as were nucleotides with quality Phred $\leq$ 2, using the program Trimmmatic version 0.32 \citep{Bolger:2014ek}. The adapter and quality trimmed, error corrected reads were then assembled using Trinity release r20140717 or SOADdenovo-Trans version 1.03. Trinity was employed using default settings, while SOAPdenovo-Trans was employed after optimizing kmer size, [and those other flags i forget right now]. \\  For the purposes of demonstrating To demonstrate  the efficacy of using multiple assemblers, in this case Trinity and SOAPdenovo-Trans, we merged assemblies via the following process. [Richard fill in details]. For the assembly generated for the illustration of the shortcomings of length based evaluation, we generated an assembly using Trinity that employed settings purposely designed to increase the length of contigs while sacficing accuracy (--path\_reinforcement\_distance 1 --min\_per\_id\_same\_path 80 --max\_diffs\_same\_path 5 --min\_glue 1). \\  All assemblies Assemblies  were characterized using Transrate version 0.31. 1.0.0.beta1.  Using this software, we generated three kinds of metrics:  contig metrics, metrics;  mapping metrics, metrics which used as input the same reads that were fed into the assembler for each assembly;  and comparative metrics which used a as input  the \textit{Mus musculus} version 75 'all'  protein file downloaded from Ensembl. All commands Ensembl  for generating the assemblies and downstream analyses are available at []. all assemblies.