Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization method used in many machine learning and deep learning algorithms to update and optimize the parameters of a model to maximize the accuracy of predictions.

The SGD algorithm works in the same way as conventional Gradient Descent, but uses only a single, randomly selected data point (or a small group of data points, known as minibatch), to calculate the Gradient at each iteration, instead of using the entire dataset. This method makes the algorithm significantly faster and allows it to work efficiently even with very large datasets.

dl_optimizer_sgd.png

The Stochastic Gradient Descent procedure begins with random initialization of model parameters. Then, a random data instance or minibatch is selected from the training set. Based on this single data instance or minibatch, the gradient of loss regarding the model's parameters is calculated. The model parameters are then updated in the direction of negative gradient to minimize loss. This is done using a hyperparameter called "learning rate" which determines how far we move in the direction of negative Gradient. These steps are repeated for a certain number of iterations, or until loss falls below a certain threshold or stops improving.

While SGD is efficient, it can also be unstable, as using only a single data instance or minibatch can cause gradient descent to "jump" in different directions. To mitigate this problem, variations of SGD are often used that employ additional techniques like momentum or adaptive learning.