Materials and Methods
We conducted a non-exhaustive survey via Web of Science (April 2018) to understand what methodologies were being used for core membership assignment. We limited our survey to papers containing the terms ‘core’ and ‘microbiome’ within the title or abstract, resulting in 1034 papers. We selected papers from 2008-2018, then ordered the search results by the number of citations and considered the 200 most cited papers. Of the 200 papers, 45 publications sufficiently detailed their methods for assigning core membership. These remaining papers were then subdivided into five categories according to the varying methodologies and criteria that were used to identify core taxa (Table 1). While not an exhaustive review, the survey provided an adequate representation and summary of the core assignment methodologies used in contemporary analyses. The categories are as follows: cumulative proportion of sequence reads, proportion of replicates, proportion of sequence reads and replicates, hard cutoff, and the Venn diagram method. In practice, DNA sequence reads (hereafter referred to as “reads”) within a given level of sequence similarity (e.g. 97%, 99%, or 100%) are counted as a measure of taxon abundance. The counts of all taxa found in the taxonomic table, as used here to describe the composition of microbiomes, are analogous to counts of plants, animals, or any other counts of organisms in community ecology. In microbial and other organismal ecology, the total sampling effort is finite and greater certainty of taxon relative abundance is achieved with increasing sampling effort (increasing read depth in microbiome research), with diminishing returns (Forcino, Leighton, Twerdy, & Cahill, 2015; Zaheer et al., 2018).
Here we use four out of the five methods described in our literature review on both simulated and real published datasets: cumulative proportion of sequence reads, proportion of replicates, proportion of sequence reads and replicates, and hard cutoffs (Table 1). We excluded the Venn Diagram method from our analysis because it can be considered as a more stringent version of the proportion of replicates method, in that a taxon must be present in every sample to be included in the core.
The first method, cumulative proportion of sequence reads, ranks taxa based on relative abundance and includes the most abundant taxa in the core. Briefly, this is calculated by ranking all taxa by their relative abundance, from highest to lowest. Starting with the most abundant, the percentage of total reads is summed cumulatively until 75% of the total reads are accounted for. Taxa that account for a portion of the first 75% of total reads are assigned core membership. This method was adopted from vegetation counts in classic ecology (Hanski, 1982) and accounts for the relative abundance of individual taxa and the total sampling effort. The second method, proportion of replicates, assigns a taxon to the core when that taxon is present in at least 50% of samples within a given treatment (or observational category; our study included a single treatment). This method assigns taxa to the core that are found in the majority of samples and accounts for sample size alone. The third method, proportion of sequence reads and replicates, assigns a taxon to the core if it is present in a pre-determined proportion of the total number of samples from a given treatment (or observational category) as well a predetermined proportion of the total reads (Delgado-Baquerizo et al., 2018). This third method accounts for both sample size and sampling effort simultaneously. Though various proportions of samples and reads were used in publications, for our analysis, we set these thresholds to include taxa present in at least 50% of samples within a given treatment (we only included a single treatment) and account for at least 0.02% of the total reads across all samples (adapted from Callahan et al. (2016)). For the fourth method, hard cutoffs, we implemented a cutoff similar to Lundberg et al. (2012), such that for a taxon to be considered part of the core microbiome, it must be present in at least some number of samples and with at least some number of reads. This method differs from the proportion of sequence reads and replicates method in that the hard-cutoff method a priori assigns an absolute value for the cutoff that is not based on the size of the experiment or sequencing depth achieved. In our analysis, we used the following cutoffs: the taxon must be present in five samples with at least 25 total reads across all samples (Lundberg et al., 2012).
To test each of the four core methods, we used two published and 6,250 simulated datasets. For the two published datasets (Table 2), we used 1) the rhizosphere and site M21 subset of the final rarified operational taxon table from the Arabidopsis thaliana root microbiome project ((Lundberg et al., 2012); dataset name “arabidopsis_R_M21.rda”) and 2) the fecal sample subset from The Human Microbiome Project ((Consortium, 2012); dataset name “human_stool.rda”). The two published datasets were plotted by the log-transformed taxon mean abundance and the coefficient of variance (CV) to examine the grouping of taxa assigned to the core. This was done to assess whether a clear threshold in abundance and coefficient of variance existed between core and non-core taxa for any of the examined core assignment methods, as an obvious threshold may provide support for that core assignment methodology. Additionally, we used 250 simulations for each of 25 possible combinations of 1) five levels of magnitude of difference in abundance (π) of core versus non-core taxa (represented as the πcorenon-core , ranging from 1 to 25), and 2) five levels of variance of the abundances πcoreto πnon-core among replicates (quantified by an intensity parameter θ, ranging from 1 to 50). This resulted in a total of 6,250 unique simulations to assess each assignment method. Each simulation of taxon relative abundances involved random draws from a Dirichlet distribution parameterized by the expected frequencies of all taxa (Σπi = 1, with 25 taxa parameterized by πcore and 975 by πnon-core) and a single intensity parameter (θ) that affects the precision of taxon abundances (i.e. scales the variance around expected taxon abundance defined by πcore and πnon-core), using R v3.4.2 (R Development Core Team, 2020). Across sets of simulations, we varied the relative abundance of core and non-core taxa (πcorenon-core), with πcorenon-core = 1 corresponding to a community that lacks a true core, because all taxa have equal expected abundances. This ratio acts as a control, in that core taxa should not be recovered from this dataset. On the other end, a πcore/πnon-core = 25 simulated a dataset in which core taxa had an abundance 25 times greater than non-core taxa. Further, we utilized the intensity parameter (θ) to set the precision of taxa abundances across replicates for a given set of expected frequencies (π), with θ of 50 corresponding to high precision and low variance in taxon relative abundances among replicates and a θ of 1 leading to low precision and a large variance in taxon relative abundances. All simulations with πcorenon-core > 1 were of 25 taxa as core taxa and the remainder 975 as non-core. The 25 core taxa were simulated to receive the expected abundances and precision of core taxa, while the non-core received the abundances and precision expected for non-core members. Consequently, up to 25 taxa could be detected as true core taxa (true positives), and 975 taxa as false core members (false positives) or true non-core members (true negatives). Additionally, simulations of πcorenon-core = 1 were useful in quantifying the false positive rate, as these communities did not include any core taxa. A simulation’s random draw from the Dirichlet distribution yielded a vector of sample proportions for each of 1000 taxa (P(p1, p2, …, p1000| πcore,πnon-core, θ), to which the four criteria for core membership were applied directly.
The ability of each method to accurately recover the known core was assessed using simulated taxon tables by the following metrics: true positive rate (signal), false positive rate (noise), and net assignment value (signal-noise). The true positive rate represents the proportion of known core taxa that were classified as such, regardless of the number of false negatives. This is expressed as the probability of a true core taxon being assigned as such. The false positive rate represents the proportion of non-core taxa classified as core and is represented as the probability of a non-core taxon being assigned as a member of the core. The net assignment value represents the differences between the absolute number of true positives and the number of false positives. A net assignment value of 25 represents perfect classification. A net assignment value of -975 would indicate all non-core taxa were ill-assigned to the core with no true positive classifications. The more negative the number, the more highly inflated the core is. This metric can be interpreted as the difference in signal and noise.
Next, to examine to what extent core assignments produce the same differences in community diversity (beta-diversity) as the entire dataset, we created dissimilarity matrices with two different distance metrics (Bray-Curtis and Jaccard) using each set of core assignments and the full dataset as independent inputs. We then used PermANOVA testing (adonis) in the vegan package (Oksanen et al., 2018) to examine the variance explained by categorical predictors in each of the core and full datasets. We examined three categorical predictors for the human microbiome project dataset: patient visit number (1st, 2nd, or 3rd), sex of subject (male vs. female), and where the sequencing was performed (12 different sequencing centers). In the Arabidopsis dataset we were able to examine two categorical predictors: developmental stage (young vs. old) and genotype (nine different genotypes). Results from the core subsets and total dataset were compared and differing statistical significance for categorical variables was noted.
In addition to examining differences in beta-diversity between the full dataset and core subsets, we also used cooccurrence networks to identify significant taxa. Cooccurrence networks are used in microbial ecology to determine microbial taxa that occur together in a statistically significant manner, with nodes and edges representing significant microbes and the connections between them respectively. Network analyses can be used to mine for keystone or hub taxa and can be used to identify interactions between groups of microbial taxa (Banerjee et al., 2016, 2018; Shi et al., 2020). The Arabidopsis rhizosphere microbiome graph consisted of 14,890 agents (taxa) and 288 artifacts (samples), and the human microbiome project graph consisted of 11,752 agents (taxa) and 319 artifacts (samples). From each bipartite graph we obtained the weighted bipartite projection, then extracted its signed backbone using the backbone package (Domagalski, Neal, & Sagan, 2019). Edges were retained in the backbone if their weights were statistically significant (alpha = 1e-04, Bonferroni corrected) by comparison to a null Hypergeometric Model (Neal, 2013). The corresponding nodes IDs (taxa IDs) with significant edges were then extracted and compared to the core assignments core assignment methods. Assuming core taxa drive abundance and occurrence patterns, and that rare taxa do not largely contribute to community structure, one could expect to find core taxa serving as important nodes with many significant edges (higher degree centrality) and the opposite to be true for non-core assignments.
To facilitate the use of our analytical methods by researchers curious about the validity of core microbiome assignment, we wrote an R package, CoreMicro, that can be installed through github (github.com/mayagans/coremicro). The package includes functions that accept a taxon table as an input and can be used to generate plots and tables of core inclusion by method. This functionalized approach facilitates the comparison of methods and provides a means to check for the existence of the hypothesized core:non-core divide. In addition, the package includes all data, including the full used within this study as well as the code for all simulations. Full OTU tables and metadata files of theArabidopsis and Human Microbiome Project datasets can be found at 10.5281/zenodo.4909346.