MASH Ondov 2016
is a novel algorithm to estimate genome and metagenome distance between assembled genomes or reads. MASH creates a reduced and representative sketch of the input sequence, and a fingerprinting function can be used to calculate a numerical distance between two sketches.
The purpose of this project is to use MASH to find regions of selection in Brassica
genomes. Syntenic regions in different cultivars should display varying levels of distances based on positive or negative selection. Figure 1
shows the UPGMA tree for Tapidor and Darmor - the distance between Tapidor C02 and Darmor C02 is the highest, which is also evidenced in the Tapidor paper, with C02 carrying the largest number of genes present in Darmor but not present in Tapidor (23 genes lost out of 73 on all pseudomolecules).
Break all available Brassica genomes into 10kb or 50kb pieces
Compare all vs. all using MASH, should not take longer than a day
Compare distances - do we see bins containing genes that have a higher/lower distance than expected? What are those genes? Do the predicted transcripts/proteins mirror those distances?
Can we align genomes and plot the MASH distance?