2.3.5 Stochastic Gradient Descent Model
Aiming at the disadvantage of slow training speed of BGD algorithm, SGD
algorithm is proposed35. The common BGD algorithm is
to pass all samples once per iteration and update the gradient once for
each training group of samples.SGD algorithm randomly extracts a group
from samples, updates it once according to gradient after training, and
then extracts another group and updates it again. In the case of
extremely large sample size, it may not need to train all samples to
obtain a model with acceptable loss value.The core of stochastic
gradient descent is: gradient is expectation.Small sample estimates are
expected.Specifically, in each step of the algorithm, we evenly extract
a small batch of samples from the training set B={x(1) …, x (m
’)}.The number of small batches m ’is usually a relatively small
number, ranging from one to several hundred.Importantly, m prime is
usually fixed as the size of the training set increases.We’re probably
fitting billions of samples, and we’re only using a few hundred samples
per update