Optimizer


Vanilla gradient descent / Batch gradient descent

Calculate loss function, then gradient g (loss function with respect to parameters weights w, g = delta_w(J(w))). Reduce the weights in the reverse direction of gradient. This is on whole data for each update, very slow.

w_t = w_t-1 - step_size * g


Stochastic gradient descent (SGD)

Stochastic means randomly chosen from the dataset. g is calculated from each sample in the dataset

Might fluctuate, but converge faster


Mini batch gradient descent

more stable convergence, reduces the variance of the parameter updates. g is calculated from every mini-batch. 



Some variations with consideration of momentum (gradients from last few iterations) to solve being stuck in local minimum. Also the same learning rate applies to all parameter updates, not optimal for sparse data with different frequencies


Nesterov accelerated gradient (NAG)

Introducing momentum term. w-gama*v_t-1 is an estimation of next move. NAG first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction.

v_t = gama*v_t-1 + step_size*delta_w(J(w-gama*v_t-1))

w_t = w_t-1 - v_t

gama around 0.9


Adaptive Gradient Algorithm (AdaGrad)

Learning rate for each parameter. Keep a historical sum of squares in each dimension.

SumGradSquare += E[g^2]

w_t = w_t-1 - step_size/sqrt(SumGradSquare)*g

Benefit is it eliminates the need to manually tune the learning rate. Default as 0.01 usually. 

Note that square root sum is in the denominator, will increase during training, will slow down the movement when reaching minina, but might be infinitely small so that no new knowledge is learnt. 


Rprop

Only using the sign of the gradient + adapting the step size individually for each weight. w_t = w_t-1 - sign(g_t) * step_size

Use full samples to calculate gradient. Due to the various gradients in the samples, some large some small, here it uses only the sign of the gradient from the full data.  

For each parameter, step size adapts individually over time. Specifically, each time look at the signs of the last two gradients, if same, increase the step size by a small fraction, if not, decrease the step size by a small fraction. 


Root Mean Square Propagation (RMSProp)

Central idea: keeps a moving average of g^2 for each weight. 

Moving average (Large portion of g^2 from previous, small portion from current g^2 added in), then a square root mean. 

E[g^2]_t = beta*E[g^2]_t-1+(1-beta)*g^2

w_t = w_t-1 - step_size/sqrt(E[g^2]_t)*g

beta default 0.9

RMSProp is extension to Rprop with its robustness using sign of gradient. Sign of g can be considered as g divide by magnitude of g, so it extends the magnitude to the square root mean of g^2 in previous steps.


Adadelta

Similar extension to AdaGrad like RMSProp. Also restricts the window of accumulated past gradients to some fixed size.

E[g^2]_t = beta*E[g^2]_t-1+(1-beta)*g^2

However, no need to set step_size. First define another exponentially decaying average on squared parameter updates

E[delta_w^2]_t = beta*E[delta_w^2]_t-1+(1-beta)*delta_w^2

RMS[delta_w]_t= sqrt(E[delta_w^2]+eps), eps is a small number to avoid devide by 0

w_t = w_t-1 - RMS[delta_w]_t-1 / RMS[g]_t * g


Adaptive Moment Estimation (Adam)

In addition to first moment (mean) of gradients, also makes use of the average of the second moments of the gradients (the uncentered variance)

learning_rate (alpha), beta1/2: exponential decay rate for the 1st/2nd moment estimates

v, s are first and second moments that records the past normalised gradient E[g], E[g^2]

v_t = beta1*v_t-1+(1-beta1)*g

s_t = beta2*s_t-1+(1-beta2)*g^2

v s need to be corrected for bias towards zeros

v_t_norm = v_t/(1-beta1^t)

s_t_norm = s_t/(1-beta2^t)

weight update is similar to AdaGrad

w_t = w_t-1 - step_size/sqrt(s_t_norm+eps)*v_t_norm

eps is a small number to avoid devide by 0




Reference

https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a

https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

https://blog.csdn.net/lomodays207/article/details/84027365

https://ruder.io/optimizing-gradient-descent/

Last Article Next article

Comment 评论



Share 分享

New Users 最新加入

  • hokurikustr

  • refrain

  • 鑫鑫

New comments 最新评论

test123: aasdas Details Apr 13 16:39
admin: Thanks! Details Apr 09 11:46
admin: Google map api Details Apr 09 11:46
lqj12: cooooooooool Details Apr 08 21:34
Yunhan Huang: 这个功能是如何实现的? Details Apr 08 13:23