Rationale

In theory, assembly graphs are much more complete than the contigs we end up writing to a FASTA file. So to check this we decided to do a small test with an A. thaliana w2rap-contigger assembly.

The idea is that if you look for both genes and or contiguous chunks of the genome in the assembly graph, as long as the appropriate connections exist, you can find the complete genes.

If we find this useful, we can probably include a little tool to do the alignments on the graph output of the w2rap-contigger and give you back a subset of the graph, and even stitch the paths together into an ad-hoc reference sequence chunk.

Looking for interesting (i.e. broken on contigs, path in graph) genes manually

We aligned genes to the a.lines.fasta output, and we aligned them to the edges of the large_K.final_raw.fasta which are the sequence components of the large_K.final_raw.gfa. By examining bandage outputs we have found a couple of genes that look promising.

Example 1: a loop

Gene contig
AT1G12310.1 flattened_line_162
AT1G12320.1 flattened_line_162
AT1G12330.1 flattened_line_162
AT1G12340.1 flattened_line_1247
AT1G12340.1 flattened_line_392
AT1G12350.1 flattened_line_1247
AT1G12360.1 flattened_line_1247
AT1G12370.2 flattened_line_1247
AT1G12380.1 flattened_line_1247
AT1G12390.1 flattened_line_1247
AT1G12390.1 flattened_line_392
AT1G12400.1 flattened_line_392
AT1G12410.1 flattened_line_392

If we look at the mapping of these genes to edges, so this looks like AT1G12340 is a promising candidate for being split unnecessarily (besides the whole region being not put together properly).

So, the corresponding edges on the graph are:

Gene edge
AT1G12340.1 edge187513
AT1G12340.1 edge868556
AT1G12340.1 edge868557

So we decided to go into bandage and have a look at the region and... a loop! (see figure 1) It makes sense that it has not been solved (with the parameters and algorithms we used, anyway), but it is also obvious how the region actually doesn't show nothing new vs. the reference genes.

Creating some whole-genome stats for this assembly graph

Interrupted genes in contig ends.

Method:

• Run an alignment of genes to contigs and find which genes map to the very end of them, and "hang over".
• Produce a list of those genes, and also a list of genes that map "prefectly" within a contig.
• So we have 3 sets: perfects, hanging, others.

Interrupted genes in contig ends perfectly represented in the graph.

• Take a gene from the "broken" sets:
• Find all possible locations for each kmer of the gene.
• Find all possible paths with orderder sets of kmers from the gene.
• Construct sequences for all possible paths.
• Align gene to each one of the constructed sequences, and if a perfect alignmenet is found, declare that path "winner", and mark the gene as solved.