Yersinia are a diverse genus of gram-negative enterobacteriaceae, three species of which are well defined human pathogens Y.pestis, Y.pseudotuberculosis and Y.enterocolitica. Identifying and gauging the pathogenicity of a given Yersinia species is an important for public health monitoring. Here we have used complementory bioinformatic tools to determine the species of an unknown Yersinia sample using whole genome sequencing, followed by genome annotation to search for known pathogentic genome features in Yersinia. Phylogentic analysis confirmed that the query sequence belonged to the Y.enterocolitica species. The absence of sequence for the pYV plasmid and the ’high-pathogenicity island’, are indicative of a non-pathogenic 1A strain. We also identified genes in the query strain which were not present in other pathogenic Y.enterocolitica 1B biovars. This may be indicative of other functional differences that influence pathogenicity.


Yersinia is a genus of gram-negative enterobacteriaceae. Of the characterised species of Yersinia, three have been particularly well studied due to their pathogenicity to humans; Y.pestis, Y.pseudotuberculosis and Y.enterocolitica. Y.pestis is highly pathogeneic, resulting in a systemic disease (’plague’) which affects multiple organ systems; lungs, lymph nodes and blood vessels. Conversly, Y.pseudotuberculosis and Y.enterocolitica are enteropathogens, primarily affecting the gastrointestinal system where they can cause local inflammation, diarrhea and fever. Furthermore, while Y.pestis is transmitted through flea bites, Y.enterocolitica and Y.pseudotuberculosis infections are primarily the result of consuming contaminated food or water. Other Yersinia species are not thought to be pathogenic to humans.

Y.entrocolitica strains are particularly diverse, containing a spectrum of non-pathogenic (1A), mildly-pathogenic (2-5) and pathogenic (1B) biovars., which can be further differentiated based on serotype. Interestinly, while biovar 1A is primarily found in North America, non-pathogenic Y.enterocolitica are more common in Japan and Europe (Schubert 2004). Isolating the genomic features that determine the virulance of Yersinia is of major interest. Perhaps the most well established is the  70kb pYV plasmid which is common to all pathogenic Yersinia, including pathogenic members of Y.enterocolitica. Similarly the yersiniabactin gene cluster, located in the ’high-pathogenicity island’, is not evident in non-pathogenic Yersinia (Schubert 2004).

Given the diversity within the Yersinia genus, the ability to quickly identify a Yersinia species from a sample is important for public health. This has been aided by the development of; i) high-throughput sequencing of whole bacterial genomes and, ii) curated databases of genomic features that confer pathogenicity to humans, such as YersinaBase ( In this study we use a range of bioinformatic tools, on whole-genome sequence from an unknown Yersinia sample, in an attempt to correctly identify the species and to gauge the pathogenicity of the Yersinia on human health.

Materials and Methods

5,260,610 76bp Illumina MiSeq reads (paired-end) from an unknown textitYersinia genome were provided to us in FASTQ format. Read quality was assessed using a combination of FASTQC ( and the Fastx-toolkit ( All quality metrics indicated the data was of high quality (Figure 1). Residual Illumina adapter sequence was detected and removed using the fastx-clipper.

Interleved forward and reverse reads were assembled using Velvet (v1.2.09) (Zerbino 2008). The VelvetOptimiser script was used to select the optimal kmer length (optimal kmer=53) and to determine coverage threshold (optimal cov_cutoff=1.96) optimized for ’n50’ ( Fourty contigs larger than 1kbp were assembled, with an average length of 118,677bp (n50 = 276703bp). The total length of all contigs was 4,747,089bp, the largest contig was 563,205bp. This is broadly consistent with known Yersinea genome sizes.

To perform a phylogenetic analysis of Yersinia species, 16S ribosomal subunit sequence was downloaded from GenBank. In total 34 RefSeq sequences from 17 different Yersinia species were analysed (Table 1). To supplement the analysis with Y.enterocolitica species from the full spectrum of biovars (1A, 1B and 2-5), contigs from fifteen Y.enterocolitica samples, reported in Reuter et al, were downloaded from the European Nucleotide Archieve (Reuter 2014) (Table 2). Where not already available, 16S ribosomal subunit nucleotide sequence were extracted from assembled contigs using the RNAmmer server (v1.2) (Lagesen 2007). 16S FASTA sequences were aligned using Clustal Omega (Sievers 2011, Goujon 2010, McWilliam 2013) and alignment files were used to construct phylogenetic trees in Seaview using a parsimony model with 100 bootstrap replicates (Gouy 2010).

Contigs were scaffolded to a reference Y.enterocolitica genome (Genbank: AM286415.1) using the Contiguator web application ( (Galardini 2011). Scaffold and contig assemblies were annotated with gene features using two independent tools; PROKKA (Seemann 2014) and the RAST server (Aziz 2008, Overbeek 2014). Annotations in Genbank format were uploaded to Artemis for visulaisation (Rutherford 2000).

PathogenFinder v1.1 and ResFinder v2.1 were used to identify and rank pathogenic genes and antibiotic resistance genes respectively (Cosentino 2013) (Zankari 2012). The identification of bacterial insertion sequences was performed using the ISFinder website ( (Siguier 2006).