Gradient Descent Optimization [Part 3]

This is a continuation of the Gradient Descent series Part0, Part1 and Part2. So far we’ve seen the algorithm and taken 1D and 2D examples to analyse how their choice affects the convergence.

In the context of deep learning gradient descent can be classified into following categories.

  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent
  • Batch Gradient Descent

Stochastic Gradient Descent: Suppose you want to minimize an objective function that is written as a sum of differentiable functions.

Q(w) = \sum_{i=1}^{n} Q_{i}(w)

Each term Q_{i} is usually associated with the i^{th} data point.

Standard gradient descent (batch gradient descent) is given as follows.

w = w - \eta \nabla Q(w) = w - \eta \sum_{i=1}^{n} \nabla Q_{i}(w)

where \eta is the learning rate (step size).

Stochastic Gradient Descent (SGD) considers only a subset of summand functions at every iteration. This can be effective for large-scale problems. The gradient of Q(w) is approximated by a gradient at a single example:

w = w - \eta  \nabla Q_{i}(w)

This update needs to be done for each training example. Several passes might be necessary over the training set until the algorithm converges.

Pseudo-code for SGD:

  • Choose an initial value of w and \eta.
  • Repeat until converged.
    • Randomly shuffle data points in the training set.
    • For i = 1,2,3,...,n, do:
    • w = w - \eta \nabla Q_{i}(w)

Example:

Suppose y = w_{1} + w_{2}x

The objective function is:

Q(w) = \sum_{i=1}^{n} Q_{i}(w) = \sum_{i=1}^{n} (w_{1} + w_{2}x_{i}-y_{i})^{2}

The update rule for SGD then is as follows.

\begin{bmatrix} w_{1}\\ w_{2} \end{bmatrix} := \begin{bmatrix} w_{1} \\ w_{2} \end{bmatrix} - \eta \begin{bmatrix} 2(w_{1}+w_{2}x_{i}-y_{i}) \\ 2x_{i}(w_{1}+w_{2}x_{i}-y_{i}) \end{bmatrix}

Mini Batch Gradient Descent:

  • Batch gradient descent uses all n data points in each iteration.
  • Stochastic gradient descent uses 1 data point in each iteration.
  • Mini-batch gradient descent uses b data points in each iteration. b is a parameter called mini-batch size.

Pseudo-code for Mini-Batch Gradient Descent:

  • Choose an initial value of w and \eta.
  • Choose for example b = 10
  • Repeat until converged.
    • Randomly shuffle data points in the training set.
    • For i = 1,11,21,...,n-9, do:
    • w = w - \eta \sum_{k=i}^{i+9} \nabla Q_{i}(w)

Few points on the three variants.

  • SGD is noisier than the other two variants because it is updated per example.
  • SGD is often called an online machine learning algorithm because it updates the model for each training example.
  • SGD is computationally more expensive than the other variants because it per example.
  • SGD is jittery and cause convergence to a minimum difficult.
  • Mini-batch gradient descent allows you to take the advantage of your GPU and process multiple examples in one go therefore reducing the training time.
  • Batch gradient descent is practically infeasible unless you have a small data-set which can fit into the available memory.

Choice of mini-batch size parameter #latex b$ depends on your GPU, data-set and the model. A rule of thumb is to choose a size in multiples of 8. Generally, 32 is a good number to start with. If you choose your number too high or low then the algorithm can become slower. In the former case, computations could become slower because you are putting a lot of load on the GPU. While in the latter case, lower mini-batch size underutilizes your GPU. Finding the right balance for your case is important.

Tuning the learning rate \eta:

If \eta is too high, the algorithm diverges. If \eta is too low, it makes the algorithm slow to converge.

A common practice is to make \eta_{n} a decreasing function of iteration number n. Following are two examples.

  • \eta_{n} = \frac{k}{n+c} , where k and c are constants.
  • \eta_{n} = \eta_{0}e^{-kn} , where \eta_{0} is the initial learning rate and k decides the steepness of the exponential decay.

One can use interval based learning rate schedule too. For example:

n \in [0,10) \; \eta = 0.01 \\ n \in [10,20) \; \eta = 0.001 \\ n \in [20,30) \; \eta = 0.0001 \\

where n is the iteration number.

The first iterations cause large changes in w, while the later ones do only fine tuning.

There is something called as cyclical learning rates where you use a periodic function for scheduling. You can read more about it here.

This wraps up the discussion for Gradient Descent and concludes the series Part0, Part1, Part2 and Part3.

To summarize:

  • Gradient descent is an optimization algorithm that is used in deep learning to minimize the cost function w.r.t. the model parameters.
  • It does not guarantee convergence to the global minimum.
  • The convergence depends on the start point, learning rate and number of iterations.
  • In practice, mini-batch gradient descent is used with a batch size value of 32 (this can vary depending on your application).

4 thoughts on “Gradient Descent Optimization [Part 3]

Leave a comment