2.3.5 Stochastic Gradient Descent Model
Aiming at the disadvantage of slow training speed of BGD algorithm, SGD algorithm is proposed35. The common BGD algorithm is to pass all samples once per iteration and update the gradient once for each training group of samples.SGD algorithm randomly extracts a group from samples, updates it once according to gradient after training, and then extracts another group and updates it again. In the case of extremely large sample size, it may not need to train all samples to obtain a model with acceptable loss value.The core of stochastic gradient descent is: gradient is expectation.Small sample estimates are expected.Specifically, in each step of the algorithm, we evenly extract a small batch of samples from the training set B={x(1) …, x (m ’)}.The number of small batches m ’is usually a relatively small number, ranging from one to several hundred.Importantly, m prime is usually fixed as the size of the training set increases.We’re probably fitting billions of samples, and we’re only using a few hundred samples per update