Availability and requirements

Pipelines and tutorials

Project name: Sewing machine pipeline

Project home page: The Sewing machine script and tutorial are available at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/blob/master/KSU_bioinfo_lab/stitch/sewing_machine_LAB.md.

Operating system(s): Linux (tested on CentOS 7, Gentoo and Ubuntu).

Programming language: Perl, Rscript, Bash

License: Pipeline script and tutorial are available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: Sewing machine requires BioPerl and BNGCompare. RefAligner is also required between iterations and can be provided by request by Bionano Genomics http://www.bionanogenomics.com/.

Project name: “Raw data-to-finished assembly and assembly analysis” pipeline Project home page: The pipeline script and tutorial are available at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/blob/master/KSU_bioinfo_lab/assemble_XeonPhi/assemble_XeonPhi_LAB.md.

Operating system(s): Xeon Phi server with 1488 threads (6x60x4 Xeon Phi co-processor threads + 24x2 Xeon host threads) and 256GB of host RAM + 6 x 8GB Xeon Phi Ram, and Linux CentOS 7.

Programming language: Perl, Rscript, Bash

License: Pipeline script and tutorial are available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: AssembleIrysXeonPhi.pl and AssembleIrysCluster.pl requires DRMAA job submission libraries. RefAligner and Assembler are also required and can be provided by request by Bionano Genomics http://www.bionanogenomics.com/.

Project name: “Raw data-to-finished de novo assembly and assembly analysis” pipeline Project home page: The pipeline script and tutorial are available at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/blob/master/KSU_bioinfo_lab/assemble_XeonPhi/assemble_XeonPhi_de_novo_LAB.md.

Operating system(s): Xeon Phi server with 1488 threads (6x60x4 Xeon Phi co-processor threads + 24x2 Xeon host threads) and 256GB of host RAM + 6 x 8GB Xeon Phi Ram, and Linux CentOS 7.

Programming language: Perl, Rscript, Bash

License: Pipeline script and tutorial are available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: AssembleIrysXeonPhi.pl and AssembleIrysCluster.pl requires DRMAA job submission libraries. RefAligner and Assembler are also required and can be provided by request by Bionano Genomics http://www.bionanogenomics.com/.

Assembly scripts

Project name: AssembleIrysXeonPhi.pl / AssembleIrysCluster.pl

Project home page: AssembleIrysXeonPhi scripts are available at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/blob/master/KSU_bioinfo_lab/assemble_XeonPhi/AssembleIrysXeonPhi.pl. The currently unsupported AssembleIrysCluster scripts are available on Github at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/tree/master/KSU_bioinfo_lab/assemble_SGE_cluster

Operating system(s): Xeon Phi server with 1488 threads (6x60x4 Xeon Phi co-processor threads + 24x2 Xeon host threads) and 256GB of host RAM + 6 x 8GB Xeon Phi Ram, and Linux CentOS 7 and SGE Linux (tested on a Gentoo) cluster respectively

Programming language: Perl, Rscript, Bash

License: AssembleIrysXeonPhi and AssembleIrysCluster.pl is available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: AssembleIrysXeonPhi.pl and AssembleIrysCluster.pl requires DRMAA job submission libraries. RefAligner and Assembler are also required and can be provided by request by Bionano Genomics http://www.bionanogenomics.com/.

Super scaffolding scripts

Project name: stitch.pl

Project home page: Stitch scripts are available on Github at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/tree/master/KSU_bioinfo_lab/stitch

Operating system(s): MAC and LINUX (tested on Gentoo and Ubuntu)

Programming language: Perl, Rscript, Bash

License: stitch.pl is available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: stitch.pl requires BioPerl. RefAligner and Assembler are also required between iterations and can be provided by request by Bionano Genomics http://www.bionanogenomics.com/.

Map summary scripts

Project name: BNGCompare.pl, bnx_stats.pl, cmap_stats.pl and xmap_stats.pl

Project home page: all scripts are available on Github at https://github.com/i5K-KINBRE-script-share/Irys-scaffolding/tree/master/KSU_bioinfo_lab/map_tools and https://github.com/i5K-KINBRE-script-share/BNGCompare

Operating system(s): MAC and LINUX (tested on Gentoo and Ubuntu)

Programming language: Perl, Rscript, Bash

License: bnx_stats.pl, cmap_stats.pl and xmap_stats.pl are available free of charge to academic and non-profit institutions.

Any restrictions to use by non-academics: Please contact authors for commercial use.

Dependencies: bnx_stats.pl, cmap_stats.pl and xmap_stats.pl have no dependencies.

Competing interests

The JMS, MCC, NH, NL, and SJB declare that they have no competing interests. ETL, PS and TA are employees at BioNano Genomics and hold stock options.

Author’s contributions

MCC isolated the high molecular weight DNA and generated the image files on the Irys. ETL and JMS developed the assembly workflow. JMS wrote most of the code in the IrysScaffolding Github Repo (Stitch, AssembleIrysXeonPhi, AssembleIrysCluster, etc.). NH assisted with initial code review of analyze_irys_output (precursor to Stitch) and prepared Tcas5.0. JMS and NL manually edited Tcas5.1. JMS performed the data analyses. TA contributed to sections discussing BioNano RefAligner and Assembler. PS contributed to interpretation of results. JMS and SJB did most of the writing with contributions from all authors. All authors read and approved the final manuscript.

Acknowledgements

This project was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under grant number P20 GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.

Data for Additional file 1 and Additional file 6 was kindly made available by P.A. Larsen, J. Rogers, A.D. Yoder and the Duke Lemur Center.

The Tribolium castaneum genome project is part of the i5k Genome Sequencing Initiative for Insects and Other Arthropods.

Matthias Weissensteiner & Jochen Wolf, Uppsala University. Stephen Schaeffer from The Pennsylvania State University and Stephen Richards from the Baylor College of Medicine Human Genome Sequencing Center for the use of the D. pseudoobscura data. Mike Kanost from Kansas State University. Jeff Maughan from Brigham Young University for the use of the Amaranth data. The Udall Lab from Brigham Young University and Cotton Inc. for the use of the cotton data. Grant (NSF 1237993) for use of the Medicago data. Christopher Cunningham, University of Georgia for the use of Nicrophorus data. Catherine Peichel from the Fred Hutchinson Cancer Research Center and Michael White from the University of Georgia for the Gasterosteus data. Mirkó Palla, Ph.D., Wyss Institute Postdoctoral Fellow, Church; Laboratory - Department of Genetics, Harvard Medical School and George Church, Ph.D., Wyss Institute Core Faculty Member, Robert Winthrop Professor of Genetics at Harvard Medical School, Professor of Health Sciences and Technology at Harvard and MIT, and Senior Associate Member at the Broad Institute of Harvard and MIT for the Escherichia coli data.

Tables

Single molecule maps from T. castaneum filtered by minimum length. Molecule map N50, cumulative length and number of maps are listed for all three molecule length filters for the T. castaneum genome data.
Minimum molecule map length (kb) Molecule map N50 (kb) Cumulative length (Mb) Number of molecule maps
100 165.35 82,738.71 503,414
150 202.64 50,579.12 239,558
180 232.57 34,287.15 139,949
T. castaneum assembly summary. Assembly metrics for Tcas5.0 (the starting sequence scaffolds), the Tcas5.0 in silico maps, the consensus genome map of assembled molecule maps, the automated output of Stitch (Tcas5.1), the manually curated sequence assembly (Tcas5.2) and the sequence assembly produced by the BioNano Hybrid Scaffold software for the T. castaneum genome.
N50 (Mb) Number Cumulative Length (Mb)
Tcas5.0 sequence scaffolds 1.16 2240 160.74
Tcas5.0 in silico maps 1.20 223 152.53
Consensus genome maps 1.35 216 200.47
Tcas5.1 sequence scaffolds 3.85 2148 165.72
Tcas5.2 sequence scaffolds 4.46 2150 165.92
Tcas BioNano hybrid scaffolds 1.83 2210 175.54
Alignment of T. castaneum consensus genome maps to the in silico maps of Tcas5.0. Breadth of alignment coverage (non-redundant alignment), length of total alignment (including redundant alignments) and percent of CMAP covered (non-redundantly) were calculated for the in silico maps and the consensus genome maps of the T. castaneum genome the using xmap_stats.pl.
Breadth of alignment coverage (Mb) Length of total alignment (Mb) Percent of CMAP aligned
Tcas5.0 in silico maps 124.04 132.40 81
Consensus genome maps 131.64 132.34 67
Each T. castaneum chromosome linkage group (ChLG) before and after super scaffolding. The number of sequence scaffolds in the ordered Tcas5.0 ChLG bins and the number of sequence super scaffolds and scaffolds in the Tcas5.2 ChLG bins. The number of sequence scaffolds that were unplaced in Tcas5.0 and placed with a ChLG in Tcas5.2 is also listed.
ChLG Tcas5.0 scaffolds Unplaced scaffolds added in Tcas5.2 Tcas5.2 super scaffolds
X 13 +2 2
2 18 +1 10
3 29 +4 20
4 6 +2 2
5 17 +1 4
6 12 +6 6
7 15 - 6
8 14 +1 8
9 21 - 9
10 12 +2 10
Total 157 19 77

Additional Files

Additional file 1 — Single molecule map stretch per scan in recent flowcells.

Bases per pixel (bpp) is plotted for scans 1..\(n\) for each flowcell of mouse lemur molecules (purple). The first scan of each flowcell is indicated with a grey dashed line. The pre-adjusted molecule map stretch was determined by aligning molecule maps to the in silico maps. Data made available by P.A. Larsen, J. Rogers, A.D. Yoder and the Duke Lemur Center.

Additional file 2 — Cumulative length and number of single molecule maps per BNX file for T. castaneum data generated over time

Detailed metrics for molecule maps per BNX file (cumulative length and number of maps). Columns include cumulative length of molecule maps \(>\) 150 kb, number of molecule maps \(>\) 150 kb and date that BNX file was generated.

Additional file 3 — Single molecule map metrics and histograms from T. castaneum DNA

Detailed metrics for molecule maps including map N50, cumulative length and number of maps. Figures show histograms of per molecule map quality metrics including length, molecule map SNR and intensity, label count, label SNR and label intensity. Molecule maps are filter for minimum molecule lengths of 100, 150 or 180 kb.

Additional file 4 — Assembly of T. castaneum consensus genome maps with range of parameters

Detailed assembly metrics for assembled consensus genome maps using strict, default and relaxed “-T” parameter, p-value threshold are named Relaxed-T, Default-T and Strict-T respectively. The best “-T” parameter was used for two additional assemblies with either relaxed minimum molecule map length (relaxed-minlen) of 100 kb, rather than the 150 kb default, or a strict minimum molecule map length (strict-minlen) of 180 kb.

Additional file 5 — ChLGs before and after super scaffolding

Alignments of Tcas5.0 and Tcas5.2 in silico maps to consensus genome maps for all ChLGs. Consensus genome maps (blue with molecule coverage shown in dark blue) aligned to the in silico maps (green with contigs overlaid as translucent colored squares). Alignment to both Tcas5.2 super scaffolds (top alignment) and Tcas5.0 scaffolds (bottom alignment) are shown.

Additional file 6 — Assembly and super scaffolding with multiple genera.

We examined experiments from 16 different genera to determine if the results seen for the Tribolium castaneum genome are typical for other genomes as well.