In theory, assembly graphs are much more complete than the contigs we end up writing to a FASTA file. So to check this we decided to do a small test with an A. thaliana w2rap-contigger assembly.
The idea is that if you look for both genes and or contiguous chunks of the genome in the assembly graph, as long as the appropriate connections exist, you can find the complete genes.
If we find this useful, we can probably include a little tool to do the alignments on the graph output of the w2rap-contigger and give you back a subset of the graph, and even stitch the paths together into an ad-hoc reference sequence chunk.
We aligned genes to the a.lines.fasta output, and we aligned them to the edges of the
large_K.final_raw.fasta which are the sequence components of the
large_K.final_raw.gfa. By examining bandage outputs we have found a couple of genes that look promising.
If we look at the mapping of these genes to edges, so this looks like AT1G12340 is a promising candidate for being split unnecessarily (besides the whole region being not put together properly).
So, the corresponding edges on the graph are:
So we decided to go into bandage and have a look at the region and... a loop! (see figure 1) It makes sense that it has not been solved (with the parameters and algorithms we used, anyway), but it is also obvious how the region actually doesn't show nothing new vs. the reference genes.