3.3.3 Random Forest Model
Random forest has excellent accuracy .Our data set includes two
interrelated parts: drug activity values and 204 eigenvalues which led
to a higher dimensional data analysis. Samples with high dimensional
features can be processed by Random forests and it can assess the
importance of each feature in the classification problem. The
credibility of the study will be increased on account of the obtained
high correlation eigenbalues. The Pearson correlation coefficient
ranking results show that the correlation coefficient between some
features is greater than 0.4. (Figure 13) Principal component analysis
was adopted in this study. Principal component analysis has no effect
when the original variables are orthogonal to each other, so there is no
correlation between the variables. The results of two-dimensional
principal component analysis (pca) and three-dimensional principal
component analysis (3d pca) show that dimensionality reduction make it
easy to find representative features(Figure 14-15). The 204 features are
scaled, and the features with variance greater than 0.05 are eliminated
to obtain the most representative features. The Lasso regression model
was used to further screen out nine features with low correlation and
good orthogonality. Convert the strongly related variables to as few new
variables as possible to replace the original variables. These new,
unrelated variables represent various information in the original
variables for high-dimensional data processing purposes. In the end, we
got the model with the training set mean square error of 0.005 and
R-squared of 0.77 (Figure 16). We believe that the predictions are
credible.