2.3.7 Bagging Model
Bagging38 is also known as bootstrap aggregating
method, where T new data sets are obtained after the initial data sets
are selected for T times. Obtained by putting back a sample (for
example, to get a new data set of size n, each sample of which is
sampled randomly from the original data set, i.e., after sampling and
putting back).Based on each sampling and training, a basic learner is
trained, and then these basic learners are combined. When combining the
predicted output, Bagging usually adopts simple voting method for
classification task and simple average method for regression
task.Bagging focuses on reducing variance. The algorithm is as
follows:A) the training set is extracted from the original sample set.N
training samples are harvested per round from the original sample set
using the Bootstraping method (in a training set, some samples may be
harvested multiple times while others are not harvested at all).A total
of m rounds were extracted to obtain m training sets.(k training sets
are independent of each other) B) a model is obtained by using one
training set at a time, and a total of m models are obtained by using m
training sets.(note: there is no specific classification algorithm or
regression method here, we can adopt different classification or
regression methods according to specific problems, such as decision
tree, perceptron, etc.) C) classification: the m models obtained in the
previous step are voted to get classification results;For the regression
problem, the mean value of the above models is calculated as the final
result(all models are equally important).
Its
implified diagram is shown in Figure 8.
2.3.8 KNN Model
KNN (k-nearest Neighbor)39 works: there is a sample
data set, also known as training sample set, and each data in the sample
set has a label, that is, we know the relationship between each data in
the sample set and its classification.After data without labels are
input, each feature in the new data is compared with the corresponding
feature of the data in the sample set, and the classification label of
the data with the most similar feature (nearest neighbor) in the sample
set is extracted.Generally speaking, we only select the first k most
similar data in the sample data set, which is where k comes from in the
k-nearest neighbor algorithm. Generally, k is an integer less than 20.
Finally, the classification with the most occurrence of k most similar
data was selected as the classification of the new data. KNN does not
show the training process, which is the representative of ”lazy
learning”. It only saves the data in the training stage, and the
training time is 0, which will be processed after receiving the test
samples.