4.2 Instances of CBM coupled with ML for fermentation analysis and optimization
Routinely, CBM uses genetic and environmental conditions as inputs to predict metabolic flux distributions. However, Sridhara et al. investigated whether they could infer bacterial growth conditions from internal fluxomics in an inverse manner. For this reason, the prediction conducted using a simple linear regression. The results showed that using the intracellular flux values, carbon and nitrogen sources utilized in the initial culture medium could be predicted even with a small number of impurities [114]. In a recent study, Oyetunde et al. extracted over 1,200 curated bioprocess datasets from ∼100 articles to predict microbial factories’ performance (yield, titer, and rate). The authors generated additional flux-based features from a CBM model to augment ML input data. Next, they applied ensemble methods to alleviate data challenges such as sparse, non-standardized, and incomplete datasets. The developed ML-CBM model could predict an engineeredEscherichia coli performance with high accuracy [115]. In 2016, Wu et al. developed MFlux, an online platform, for predicting bacterial central metabolism. The authors used ML approaches (SVM, KNN, and decision tree) to train previously experimental data, including substrate types, bioprocess strategies, and genetic modifications from about 100 13C-MFA articles. MFlux outputs can be used as inputs for FBA to reduce the solution space, thus improving the model’s accuracy [116].
Most recently, a novel CBM-ML hybridization approach for time-course controlling nutrients availability in a fed-batch CHO cell culture has been developed. For this reason, Schinn et al. used ML as a tool to overcome CBM limitations, such as optimal metabolism considerations and steady-state assumptions. In this study, cell density, product titer, glucose, lactate, glutamine, and glutamate concentrations were used as constraints for the FBA solution. The metabolic model calculated the initial consumption rates of proteogenic amino acids. Next, a series of linear regressions were used to refine the predictions. Finally, using a sigmoid function, the refined consumption rates were fit to a time-course dependent profile. The model was able to correctly forecast the concentrations of 13 out of 18 amino acids [117].
Essential genes are genes that are critical for cell viability and growth. Gene essentiality is not an intrinsic trait of a gene. But instead, it can be influenced by environmental and genetic contexts [118]. Nandi et al. developed an SVM-based model named SVM-RFE to classify Escherichia coli genes as essential or non-essential. The model input included a mixture of genotypic and phenotypic features, i.e., gene and protein sequences, topological network, and gene expression. Then, they employed flux coupling analysis (FCA) to generate flux-based features to consider gene adaptability in different environmental conditions. SVM-RFE was trained on 4094 reaction-gene combinations with 64 features. The model could successfully capture the minimal set of essential genes in various environmental conditions with high accuracy [119]. This study shows the importance of selecting and describing appropriate features in an ML study.
In the context of multi-omics integration, Zampieri et al. employed a combination of CBM and ML to predict lactate production, a secondary metabolite, in CHO cell culture. In this study, transcriptomics data from different culture conditions were integrated with fluxomics data from in-silico genome-scale modeling to construct a data-driven framework. The results showed an improving performance over the predictive power of pure transcriptomic analysis [120]. Similarly, Vijayakumar et al. proposed a machine learning pipeline integrated with genome-scale modeling to improve phenotypic prediction in a lipid-producing cyanobacterium. First, they extracted RNA sequencing data from 23 different growth conditions to develop condition-specific GEMs via transcriptomics data integration. Then, FBA was performed to obtain context-specific fluxomic data. The preprocessing stage was conducted to incorporate fluxomics into experimental transcriptomics data. PCA, k-mean clustering, and LASSO regression were used to identify the dataset’s key features. As a result, a data-driven multi-view model was developed with a high phenotype predictive accuracy [121]. This strategy also has been adapted to predict yeast S. cerevisiae growth rate. In this study, fluxomics, generated from parsimonious flux balance analysis (pFBA), were coupled with transcriptomics to train neural networks [122].