## Optimizer

Calculate loss function, then gradient g (loss function with respect to parameters weights w, g = delta_w(J(w))). Reduce the weights in the reverse direction of gradient. This is on whole data for each update, very slow.

w_t = w_t-1 - step_size * g

Stochastic means randomly chosen from the dataset. g is calculated from each sample in the dataset

Might fluctuate, but converge faster

more stable convergence, reduces the variance of the parameter updates. g is calculated from every mini-batch.

Some variations with consideration of momentum (gradients from last few iterations) to solve being stuck in local minimum. Also the same learning rate applies to all parameter updates, not optimal for sparse data with different frequencies

Introducing momentum term. w-gama*v_t-1 is an estimation of next move. NAG first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction.

v_t = gama*v_t-1 + step_size*delta_w(J(w-gama*v_t-1))

w_t = w_t-1 - v_t

gama around 0.9

Learning rate for each parameter. Keep a historical sum of squares in each dimension.

Benefit is it eliminates the need to manually tune the learning rate. Default as 0.01 usually.

Note that square root sum is in the denominator, will increase during training, will slow down the movement when reaching minina, but might be infinitely small so that no new knowledge is learnt.

Rprop

Only using the sign of the gradient + adapting the step size individually for each weight. w_t = w_t-1 - sign(g_t) * step_size

Use full samples to calculate gradient. Due to the various gradients in the samples, some large some small, here it uses only the sign of the gradient from the full data.

For each parameter, step size adapts individually over time. Specifically, each time look at the signs of the last two gradients, if same, increase the step size by a small fraction, if not, decrease the step size by a small fraction.

Root Mean Square Propagation (RMSProp)

Central idea: keeps a moving average of g^2 for each weight.

Moving average (Large portion of g^2 from previous, small portion from current g^2 added in), then a square root mean.

E[g^2]_t = beta*E[g^2]_t-1+(1-beta)*g^2

w_t = w_t-1 - step_size/sqrt(E[g^2]_t)*g

beta default 0.9

RMSProp is extension to Rprop with its robustness using sign of gradient. Sign of g can be considered as g divide by magnitude of g, so it extends the magnitude to the square root mean of g^2 in previous steps.

Similar extension to AdaGrad like RMSProp. Also restricts the window of accumulated past gradients to some fixed size.

E[g^2]_t = beta*E[g^2]_t-1+(1-beta)*g^2

However, no need to set step_size. First define another exponentially decaying average on squared parameter updates

E[delta_w^2]_t = beta*E[delta_w^2]_t-1+(1-beta)*delta_w^2

RMS[delta_w]_t= sqrt(E[delta_w^2]+eps), eps is a small number to avoid devide by 0

w_t = w_t-1 - RMS[delta_w]_t-1 / RMS[g]_t * g

In addition to first moment (mean) of gradients, also makes use of the average of the second moments of the gradients (the uncentered variance)

learning_rate (alpha), beta1/2: exponential decay rate for the 1st/2nd moment estimates

v, s are first and second moments that records the past normalised gradient E[g], E[g^2]

v_t = beta1*v_t-1+(1-beta1)*g

s_t = beta2*s_t-1+(1-beta2)*g^2

v s need to be corrected for bias towards zeros

v_t_norm = v_t/(1-beta1^t)

s_t_norm = s_t/(1-beta2^t)

w_t = w_t-1 - step_size/sqrt(s_t_norm+eps)*v_t_norm

eps is a small number to avoid devide by 0

Reference

https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a

https://blog.csdn.net/lomodays207/article/details/84027365