2.4 Machine learning
The values of GFP intensities were decreased by five orders of magnitude before being evaluated by machine learning. In all machine learning algorithms except for principal component analysis (PCA), data from the E1 yeast extract was used for doubled validation calculations. The remaining data were separated into learning and test datasets with random cross-validation (85:15). PCA, PLS, and RF were performed on the Python 3.6 platform using the scikit-learn library.[20] The number of components for the PLS models was set at 6. For RF, the parameters were set as the following: max_depth, 10; max_features, 6; max_leaf_nodes, none; n_estimators, 300; random_state, 2525; in case estimate cell yields and max_depth, 5; max_features, 169; n_estimators, 50; random_state, 2525; in case of GFP yield. The parameters were set after searching for the optimal parameters using the grid search function.
NN and DNN were coded in Python 3.6 using TensorFlow 1.5 and the Keras library (https://keras.io/).[21]In all cases, the input shape was set for 205 parameters. To estimate the final yield, the output shape was a single parameter, cell yield, or GFP. For time course estimation, the output shape was set for 5 parameters corresponding to the sampling time for each cell growth and GFP sample. Conventional NN was composed of a single hidden layer with 100 units of hyperbolic tangent (tanh) activations. The network was constructed with fully connected networks. HeNormal class was used as a kernel weight initializer. Activations of output layers were set to linear. Adam algorithms were applied to the optimizer with the default setting of the Keras library. Learning was carried out to minimize the mean squared error (MSE) (eq 1). The times of training was set at 3,000. Model check point functions were record weights of the model with minimal MSE.
\begin{equation} MSE\ =\ \frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-{\overset{\overline{}}{y}}_{i}\right)^{2}\nonumber \\ \end{equation}
(eq. 1), where n indicates the number of input variables,\({\overset{\overline{}}{y}}_{i}\) indicates the measurement values of dependent variables, and yi indicates the estimate values of the dependent variables by the constructed model.
DNN were constructed with 4 hidden layers (200, 100, 50, and 20 units) and tanh activations. The number of training times was set at 10,000. The other DNN parameters corresponded to those of NN.
MIE calculations were performed with reference to the MDA calculation reported by Date and Kikuchi[19] For the MSE calculation, the values in a variable were randomly rearranged among the input data, which was called permutation, and the rearranged data matrices were evaluated by the constructed DNN model. The model loss obtained by the permutations was compared with the model loss determined by the MSE calculation. In the calculation, a relatively small influence on MSE means that the constructed model was rarely influenced by the variable. However, a relatively large influence on MSE means that the constructed model was significantly affected by the variables. Based on the criterion, the MIE can evaluate the importance of the variables in the constructed DNN model. In this study, permutations were repeated 60 times for each variable, and the average MSE for each variable calculated from the rearranged matrices was used as a representative importance.
To evaluate the effect of the important variables, a sensitivity analysis was performed to estimate the cell growths and GFP yields while varying only a single important variable in the yeast composition.
A personal computer (PC) equipped with a graphic processing unit were used for the calculations. PC Spec. OS: Ubuntu 16.04LTS, CPU: Intel Core i7-8700 (3.2-4.6 GHz / 6 cores / 12 threads / 12MB cash), Memory: DDR-2666 32 GB, GPU: NVIDIA GeForce GTX 1080Ti 11GB.