Statistical Analysis and Empirical Modeling Methods
Experimental conditions were performed in biological replicates and resulting data is presented with the standard error of the mean. CVC and mAb titer were normalized to the maximum CVC and titer of the respective process control conditions. Graphical analysis, standard error calculations, ANOVA, and one-way student’s t-tests were performed in Microsoft Excel and SAS JMP (SAS, USA), respectively.
SIMCA-P+ (Sartorius Stedium, GE) was used for MVDA modeling and detailed methods for PCA and PLS regression are described by Wold et al (MKS Umetrics AB, 2013). However, in short, multi-variate methods begin with dimensional reduction where high-dimensional datasets can be reduced to a lower dimensional space and be explained by fewer variables (i.e. latent variables). The latent variables are calculated by unit normalization of the data followed by a projection of the normalized data on a lower dimensional space, and lastly finding the directions of the greatest variance within that space or the eigen vector (Mevik & Wehrens, 2007). The total variable contribution towards the greatest variance in any given direction is described by the first latent variable or in the case of PCA, the first principal component. Every subsequent component within a PCA model is orthogonal or perpendicular to the preceding component and is aimed at explaining the total variance of a dataset. Accordingly, each component is a summation of the individual variable contributions or loadings towards the variance. Interestingly, Wold et al found that the sum of every variable contribution across all the components in a model could be represented as a variable importance to projection (VIP) which can provide a heuristic multivariate ranking system (Akarachantachote, Chadcham, & Saithanu, 2013; Prieto et al., 2014). As a result, dimensional reduction models can be used to not only provide a holistic distribution of batches or even individual observations, but also identify key variable contributions that explain the distribution between observations and ultimately, pose as potential targets for optimization efforts.
In contrast, the calculation of latent variables is modified when using supervised approaches such as PLS, as the goal for latency is the direction of the greatest co-variance between the explanatory variables and the response variable(s). Since the first predictive component in a PLS model is not the first principal component, PLS models become very powerful when there is a high degree of collinearity between the explanatory variables and the response variable. However, when the variables are nonlinear in behavior, such as in the case of amino acid stoichiometric balances, the goodness of fit (R2) and the goodness of prediction (Q2) of a PLS model are significantly impacted as a high degree of information is unaccounted for in the explanatory variable space. To circumvent this issue, Orthogonal Partial Least Squares (OPLS) was used in which the first component or the predictive component is forced to be the first principal component and every subsequent orthogonal component aims to explain the remaining co-variance between the explanatory and response variables. Consequently, OPLS models yield better predictions and increased model interpretability towards nonlinear response variables as a greater degree of information in the explanatory variable space is considered (Bylesjo M et al., 2006; Yamamoto et al., 2009). Accordingly, OPLS models built with stoichiometric balances were found to have higher Q2 values than PLS models (data not presented).
Lastly, to incorporate the time-dependent contributions of the amino acid consumption rates and resulting stoichiometric balances, the training set data matrix was transformed to a batch-level model (BLM) format. In a BLM format, every variable at every day measured across the batches was formatted to become an independent variable. As a result, each row of a BLM data matrix represented a single batch with every cell culture variable expanded by the number of timepoints it was measured as additional columns, forming a wider and shorter data matrix. The benefits of the BLM format over the untransformed dataset format were two-fold: (1) each variable in the model was able to provide a specific time-dependent contribution that was representative of a particular instance in a batch and (2) increased precision with the OPLS algorithm as each component of the model is a weighted average of all the variables and increased number of variables provided a greater reliability towards the calculation of that weighted average (Vajargah, Sadeghi-Bazargani, Mehdizadeh-Esfanjani, Savadi-Oskouei, & Farhoudi, 2012; Worley & Powers, 2013; Worley & Powers, 2015). For instance, all 20 amino acid stoichiometric balances that was captured from the 25 training batches were measured every other day from day 0 to day 14 and interpolated for the unmeasured days resulting in 15 observations per batch or 375 observations across all 25 batches. But when transformed into a BLM format, the 20 original stoichiometric balance variables were expanded to 300 variables capturing the stoichiometric balance at each time point (day 0 – day 14) and the observations collapsed to 25 rows representing one observation per batch. The complexity and interpretation of the resulting matrix was drastically reduced into latent variables using the dimensional reduction thus highlighting the benefit of MVDA modeling capabilities and its ability to retain biological time-dependent information for a large set of variables to rapidly identify key modulators to improve bioprocess development efforts.