Efficient construction of de Bruijn and string graphs: an algorithmic view of assembly graphs in the metagenomics era

loading page

gianluca.dellavedova

Abstract

De novo assembly of large sequences from sequence reads typically employs either string or de Bruijn graphs. The former are the basis for most assemblers designed for Sanger sequences, while the latter have been extensively used in assemblers designed for second generation sequence data.

Contemporary, sequencing platforms can produce terabytes of data and the construction and navigation of assembly graphs pose significant computational challenges. In particular, the generation and maintenance of compact representations of such graphs must attain a satisfactory balance of main memory usage and runtime. These challenges will become more critical as ever larger datasets are produced, particularly in metagenomics studies, where additional factors complicate sequence assembly. Recently, several promising approaches to graph construction and storage have been proposed and implemented.

In this review, we discuss both the use of the Burrows-Wheeler Transform and the FM-index in the construction and representation of de Bruijn and string graphs, and the incorporation of other computational techniques, such as the Bloom filter, to allow compact representation of such graphs. We show, with quantified examples, that by working on compact, yet functionally complete, representations of the input data, great savings in memory usage can be obtained with negligible increases in run times. Finally, we present a comparison of string and de Bruijn graph–based metagenomics assemblers, illustrating that their relative performance varies as a function of sequencing depth and diversity of community structure.