Adam Algorithm¶

Adaptive Moment Estimation (Adam) is a first-order gradient-based optimiser that maintains adaptive learning rates for each parameter using estimates of first and second moments of the gradients.

Algorithm Overview¶

Adam combines the benefits of AdaGrad (adapting to sparse gradients) and RMSprop (adapting to non-stationary objectives) by tracking exponential moving averages of both the gradient and squared gradient.

Key Properties¶

Property	Value
Type	Local, gradient-based
Parallelisable	No
Function evaluations	1 per iteration
Best for	Smooth, differentiable objectives

Mathematical Foundation¶

At each iteration $t$, Adam updates parameters $\theta$ using:

Gradient computation:

\[ g_t = \nabla_\theta f(\theta_{t-1}) \]

Biased moment estimates: $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$

Bias correction: $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $$ $$ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$

Parameter update: $$ \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

Where $\alpha$ is the learning rate (step size).

Parameters¶

Parameter	Default	Description
`max_iter`	1000	Maximum iterations
`step_size`	0.01	Learning rate $\alpha$
`beta1`	0.9	First moment decay rate $\beta_1$
`beta2`	0.999	Second moment decay rate $\beta_2$
`eps`	1e-8	Numerical stability constant $\epsilon$
`threshold`	1e-6	Objective convergence tolerance
`gradient_threshold`	None	Gradient norm convergence tolerance
`patience`	None	Timeout in seconds

Convergence Criteria¶

The algorithm terminates when any condition is met:

Iteration limit: iteration >= max_iter
Gradient norm: $\|g_t\| <$ gradient_threshold
Objective change: Change in objective below threshold
Patience: Elapsed time exceeds patience seconds

Tuning Guidance¶

Learning rate (step_size) is the most critical parameter:

Start with a conservative step-size (i.e, 1e-3)
Try orders of magnitude: (1e-2 to 1e-4)
Too large: oscillation, divergence, or overshooting
Too small: slow convergence

Beta parameters rarely need adjustment:

$\beta_1 = 0.9$: controls momentum (gradient smoothing)
$\beta_2 = 0.999$: controls adaptive scaling (variance smoothing)
Lower $\beta_1$ for less momentum, more responsiveness
Lower $\beta_2$ for faster adaptation to gradient scale changes

Epsilon almost never needs tuning:

Prevents division by zero when gradients are near zero
Default 1e-8 works for most cases

When to Use¶

Strengths:

Fast convergence on smooth objectives
Per-parameter adaptive learning rates
Handles sparse gradients well
Works with noisy gradients (e.g., mini-batches)
Low memory overhead

Limitations:

Requires gradient computation (automatic differentiation or numerical)
Converges to local minima (no global search)
Sensitive to learning rate choice
May oscillate near optima

Example¶

import diffid as chron

optimiser = (
    diffid.Adam()
    .with_max_iter(5000)
    .with_step_size(0.001)
    .with_betas(0.9, 0.999)
    .with_threshold(1e-8)
)

result = optimiser.run(problem, initial_guess=[1.0, 2.0])

Implementation Notes¶

Supports automatic numerical gradient computation via central differences
Bias correction is essential for early iterations
Tracks both current position and best-found position

References¶

Kingma, D.P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR 2015. arXiv:1412.6980.
Reddi, S.J. et al. (2018). "On the Convergence of Adam and Beyond". ICLR 2018.