Gradient Descent

Gradient Descent is a popular optimization algorithm used in various machine learning algorithms and is especially useful for training deep learning models. It's an iterative method that helps to find the minimum value of a function.

Introduction

In machine learning, we typically define a model with some parameters and then tweak these parameters to best fit our data. The measure of how well the model fits the data is given by a loss function, or cost function. Our goal is to find the parameters that minimize this loss function.

Mathematically, if we have some function , where represents the parameters of our model, gradient descent iterates over to find the minimum of .

Algorithm

The basic algorithm of gradient descent is as follows:

Initialize with some values (often random)
Compute the cost/loss
Compute the gradient (Partielle Ableitung) of the cost/loss function regarding each parameter
Update each parameter

The update rule for each parameter is given by:

Here, is known as the learning rate and controls how big a step we take towards the minimum at each iteration.

Types of Gradient Descent

There are three main types of gradient descent, which differ in how much data we use to compute the gradient at each step.

Batch Gradient Descent

Batch Gradient Descent uses all training samples in forward pass to calculate cumulative error, and then it does Backpropagation to update weights using derivatives.

Stochastic Gradient Descent

Documented under Stochastic Gradient Descent (SGD).

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between Batch GD and SGD. In each iteration, a mini-batch of 'n' training samples is used to compute the gradient of the loss function.

Advantages and Disadvantages

Gradient Descent has several advantages:

It's simple and easy to implement.
It's efficient with large datasets.
It can be used with any differentiable loss function.

However, it also has some disadvantages:

Choosing an appropriate learning rate can be difficult. A learning rate that is too small leads to slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge (Divergenz).
It's sensitive to feature scaling.
It can get stuck in local minima in case of non-convex error surfaces.