ROUGH DRAFT authorea.com/149116
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Aligning genes to assembly graphs

    Rationale

    In theory, assembly graphs are much more complete than the contigs we end up writing to a FASTA file. So to check this we decided to do a small test with an A. thaliana w2rap-contigger assembly.

    The idea is that if you look for both genes and or contiguous chunks of the genome in the assembly graph, as long as the appropriate connections exist, you can find the complete genes.

    If we find this useful, we can probably include a little tool to do the alignments on the graph output of the w2rap-contigger and give you back a subset of the graph, and even stitch the paths together into an ad-hoc reference sequence chunk.

    Looking for interesting (i.e. broken on contigs, path in graph) genes manually

    We aligned genes to the a.lines.fasta output, and we aligned them to the edges of the large_K.final_raw.fasta which are the sequence components of the large_K.final_raw.gfa. By examining bandage outputs we have found a couple of genes that look promising.

    Example 1: a loop

    Gene contig
    AT1G12310.1 flattened_line_162
    AT1G12320.1 flattened_line_162
    AT1G12330.1 flattened_line_162
    AT1G12340.1 flattened_line_1247
    AT1G12340.1 flattened_line_392
    AT1G12350.1 flattened_line_1247
    AT1G12360.1 flattened_line_1247
    AT1G12370.2 flattened_line_1247
    AT1G12380.1 flattened_line_1247
    AT1G12390.1 flattened_line_1247
    AT1G12390.1 flattened_line_392
    AT1G12400.1 flattened_line_392
    AT1G12410.1 flattened_line_392

    If we look at the mapping of these genes to edges, so this looks like AT1G12340 is a promising candidate for being split unnecessarily (besides the whole region being not put together properly).

    So, the corresponding edges on the graph are:

    Gene edge
    AT1G12340.1 edge187513
    AT1G12340.1 edge868556
    AT1G12340.1 edge868557

    So we decided to go into bandage and have a look at the region and... a loop! (see figure 1) It makes sense that it has not been solved (with the parameters and algorithms we used, anyway), but it is also obvious how the region actually doesn't show nothing new vs. the reference genes.

    Example 2: a complex region.

    Example 3: a collapsed repeat region

    Creating some whole-genome stats for this assembly graph

    Interrupted genes in contig ends.

    Method:

    • Run an alignment of genes to contigs and find which genes map to the very end of them, and "hang over".
    • Produce a list of those genes, and also a list of genes that map "prefectly" within a contig.
    • So we have 3 sets: perfects, hanging, others.

    Interrupted genes in contig ends perfectly represented in the graph.

    • Take a gene from the "broken" sets:
      • Find all possible locations for each kmer of the gene.
      • Find all possible paths with orderder sets of kmers from the gene.
      • Construct sequences for all possible paths.
      • Align gene to each one of the constructed sequences, and if a perfect alignmenet is found, declare that path "winner", and mark the gene as solved.
    A not-so-small loop, every edge ends up being an independent contig, so you loose genome contiguity. Heuristics to unroll this loop may be too risky to use by default, but you can easily recover the contiguity in your analysis by just using the graph.