Optimizer in Neural Network

In any Machine Learning or Deep Learning Models, our common goal is to reduce the cost function.

A famous and well known technique to reduce the Cost Function is Gradient Descent.

In Deep Learning, we use Gradient Descent to optimize the weights of the connections between neurons, which is finally the coefficients.

Standard Optimizers:

When discussing about Optimization, the first word comes to our mind is Gradient Descent.

In this technique, we take the entire data set and computes the Gradient Descent.
i.e. For each epochs, the entire dataset is used in the calculation of MSE.

The Disadvantage of this as we are taking the entire data set, obviously it increases the complexity.
And so it is very slow as for a Neural Network.

In the previous technique, the cause of the disadvantage is taking the entire dataset into the computaion.
With SGD, we overcome it by taking 1 random sample from the entire dataset.

This makes the computation simple.
But the problem is, even though it definitely converges to the optima, the gradient descent will jump widely (oscilate more) which takes more iteration.
In some iterations, we may take the noise in account.
So there wil be a high variation in weights.

This type is a hybrid of first 2 variations.
In this type, we take a random set of sample and compute the Gradient Descent.

As we take a set of random sample instead of 1 the noise and variance will be reduced which helps to have the steady convergance.

In addition to the above Gradient Descent Algorithms there are some Adaptive Optimization Algorithms which works along with any of the above GD algorithms.
These algorithms are gaining popularity as it makes the convergance quick and smooth.

• Momentum
• Nesterov accelerated gradient(NAG) (aka Nesterov Momentum)
• RMSProp

Momentum:

Momentum is a variation of Standard Gradient Descent which considers the past gradients to smooth out the update. It calculates an average of the past gradients, and then use that gradient to update your weights instead.
So the oscillation gets reduced.
It works faster than the standard gradient descent algorithm.

vt = γvt−1 + η∇θJ(θ) (1)

θ = θ − vt (2)

γ is the update vector of the past step to the current update venctor.

Nesterov accelerated gradient(NAG) (aka Nesterov Momentum):

Well, Momentum was smart enough to look back and take steps.

But What if we know about the forward step too? That’s exactly what NAG does.

vt = γ vt−1 + η∇θJ(θ − γvt−1) (3)

θ = θ − vt (4)

(θ − γvt−1) is the approximate value that describes the next point.

Earlier in all algorithms we were not changing learning rates.

Adaptive Gradient algorithm works in a way that learning rates will be changing w.r.t the number of iterations.

In the previous optimization techniques, the value of η will remain constant for all the iterations.

But in Adagrad, we will divide the η value by the sum of the outer product of the gradients until time-step t. We mention this term as Gt.

By dividing η this value directly will give worst result.

“diag” -> Gt is multiplied with a diagonal matrix so that the value will not vanish (will not become zero).

As we are dividing the learning rate by sum of past iterations, when the iterations increases, the learning rates will be decreased.

The only problem with this algorithm is, as the learning rate decreases, the convergence becomes slow.

RMSprop (Root Mean Square Propogation) :

This optimizer tries to overcome the shrinking learning rate by using the average of squared gradients.

With this optimizer, the learning rates will be adjusted automatically.