Brassica genome comparisons using MinHash/MASH

MASH Ondov 2016 is a novel algorithm to estimate genome and metagenome distance between assembled genomes or reads. MASH creates a reduced and representative sketch of the input sequence, and a fingerprinting function can be used to calculate a numerical distance between two sketches.

The purpose of this project is to use MASH to find regions of selection in Brassica genomes. Syntenic regions in different cultivars should display varying levels of distances based on positive or negative selection. Figure 1 shows the UPGMA tree for Tapidor and Darmor - the distance between Tapidor C02 and Darmor C02 is the highest, which is also evidenced in the Tapidor paper, with C02 carrying the largest number of genes present in Darmor but not present in Tapidor (23 genes lost out of 73 on all pseudomolecules).

  • Break all available Brassica genomes into 10kb or 50kb pieces
  • Compare all vs. all using MASH, should not take longer than a day
  • Compare distances - do we see bins containing genes that have a higher/lower distance than expected? What are those genes? Do the predicted transcripts/proteins mirror those distances?
  • Can we align genomes and plot the MASH distance?

\label{fig:UPGMA} UPGMA tree based on Darmor/Tapidor distances using MASH


  1. BD Ondov, TJ Treangen, P Melsted, AB Mallonee, NH Bergman, S Koren, AM Phillippy. Mash: fast genome and metagenome distance estimation using MinHash.. Genome Biol 17, 132 (2016).