DSI Proposal

Big Picture Questions, Motivation, Relevance

Various factors influence the patterns in the distribution and abundance of microbial community taxa in an environment. These dynamics in microbial community composition can sometimes result in large-scale population shifts such as extinction and recolonization of entire groups of taxa, or comparatively minor variations in the relative abundance of taxa in an environment. A robust microbial community can maintain equilibrium despite environmental fluctuations. Interestingly, large-scale changes in microbial community profiles across experimental conditions is often driven by a relatively small set of species. Identifying these key responders provides a method for understanding how microbe communities maintain equilibrium, and provides potential leverage for restoring balance to disturbed communities. Moreover, elucidating a mechanism for community robustness could be of great utility to many fields, including maintenance of organism health, ecosystem balance, and agriculture. Given that many host-microbe datasets share many commonalities regardless of the host species, identifying a general set of tools that characterize host-microbe community structure is crucial for understanding the basis of functions in community dynamics. Here we propose compiling a set of analysis tools and validating them on various pre-existing host-microbe datasets, as well as a benchmark dataset with host-microbe interactions that share commonalities across different systems.



While many analysis tools are currently available for the purpose of identifying key sets of taxa that correspond most strongly with changes in experimental condition, we believe a principled comparison and validation of these existing tools is still lacking. A good set of tools must work efficiently on nonparametric, large scale data; for instance, microbial connectivity patterns have high dimensionality and are often not normally distributed. In order to discard spurious results we need strong statistical models to identify the real dynamics of the community such as food webs and phylogenetic signals. We have repeatedly found that commonly used statistical corrections for high dimensionality are insufficient. We will need to explore models of different design for our data. We believe this project would particularly benefit from DSI guidance and resources in this regard.
While we have experience with data exploration, identifying descriptive metrics for seemingly disparate data types, and algorithm implementation, we do not have a comprehensive understanding for using statistics for validation of high dimensional data. We would greatly appreciate any collaboration with the DSI to guide our comparison of analysis methods so that our findings can be believed and of use to anyone working in the field of microbial community analysis.

Format and availability of data

Our pipeline will be designed to begin with DNA that was extracted from the microbiomes of host organisms, sequenced for the 16S rRNA gene, and grouped into taxa. We currently have access to plant and bird host samples with lists of microbial taxa and their corresponding read counts. These raw datasets by themselves are small (on the order of 200 samples x 10000 microbe species); however, the combinatorics of taxa co-association result in high dimensionality.


We propose to identify the most promising algorithms in terms of time and space complexity, and implement them efficiently so they will be scalable. We will test our software and validate our models using pre-existing plant microbiome and bird microbiome datasets, as well as simulated data where the structure is already known.

Pre-processing pipeline

In order to accurately compare model outputs across bird, plant, and simulated datasets, we first need to implement quality control for pre-processing the raw data. DNA sequenced from environmental extractions is subject to poor coverage, or poor depth, resulting in read counts that may be assigned unevenly or spuriously across microbial taxa. An important quality control will be to identify samples with these poor coverage and sampling depth, and exclude them from the main analysis. Further, our pipeline will be designed to account for metadata associated with each sample for easy integration in order to correlate microbial population analysis with environmental traits. The resulting cleaned data will be the input for our models. We will make the pre-processing pipeline publicly available, with tutorials included.

Algorithm pipeline

There are three main approaches that we will explore in our analysis pipeline; namely, network analysis, matrix factorizations, and statistical validation.
There are many applications in real life that can be naturally modeled as graphs. Microbial communities, given their metabolic interactions, naturally lend themselves for representation as networks. There are many ways to generate a network using our datasets and these networks can lead to various conclusions. For example, we can build a co-association networks for microbes by assigning microbes to be nodes and the interactions between them to be co-association patterns among them. We can also create sequence similarity networks for the microbes where edges show genetic relatedness. In the first case, clusters of microbes will indicate groups of microbes that co-occur together whereas in the second case, clusters will correspond to microbes that are phylogenetically similar and therefore are likely to have similar functionality. We propose to characterize network topology of multiple datasets using graphlets to illustrate how basis of co-association determines the network analysis conclusions. We will use network analysis as a tool to find patterns in microbial community and predict the functionality of each family of microbes.
Matrix factorization methods are also commonly used to identify features of a dataset. In the case of microbial communities, a feature would be sets of microbial taxa that are particularly characteristic of a sample type and/or experimental condition. We propose to use matrix factorizations including PCA, NMF and SVD for the purpose of dimension reduction and feature extraction. We would like to identify the cases for which feature extraction by matrix factorization agrees with clusters identified by network analysis.
Our ability to outline a method for statistical validation in this proposal is limited by our current ignorance of statistics. We would seek advice from the DSI in completing our validation design because this step is crucial for drawing meaningful conclusions.

Publishing results and implementations

All the data used, methods, results and their implementations will be available online for general use including the manual on how to use them. Jupyter notebooks as an interactive tool are commonly used for programming and data visualization and are a good candidate format for our purposes. We will publish our findings as a comparative methods paper preferably in an open-accessed journal. We anticipate this will be of great use for both biologists and computer scientist because it highlights both the implementation and application of many disparate methods. Currently, these methods have not been compared against each other and thus it is not known whether their conclusions are in agreement with one another.


We would like this project to require one quarter. Briefly, the first 2 weeks will be budgeted for data wrangling, and assembling preprocessing scripts into Jupyter notebooks. The second 2 weeks will be algorithm implementation and generating first rounds of visualization for comparison between datasets. The second month will be devoted to statistical validation and verification methods, tool optimization, and code efficiency. In the third month, we will write the results for publication and post our code resources and data online. names of collaborators (+ resumes)