--- title: "Optimizers : ADAM" --- Adam optimizer combines the benefits of two other extensions of stochastic gradient descent, AdaGrad and RMSProp. Specifically, Adam uses adaptive learning rate methods to find individual learning rates for each parameter. ## Key Components of Adam Optimizer: 1. **Beta Parameters (β1, β2)**: Control the exponential decay rates of moving averages for the gradient (m) and the squared gradient (v). 2. **Learning Rate (α)**: The step size used to update the weights. 3. **Gradient (g)**: The gradients of the loss function with respect to the weights. 4. **Bias-corrected First (m̂) and Second (v̂) Moment Estimates**: Adjustments to m and v to counteract their initialization at the origin. 5. **Weight Update Rule**: Uses the bias-corrected estimates to update the weights. Let's lay out these components in a dot graph: ```{dot} digraph AdamOptimizer { node [shape=record]; // Define nodes params [label="Parameters | {β1|β2|α}" shape=Mrecord]; grad [label="Gradient (g)"]; m [label="First Moment Estimate (m)"]; v [label="Second Moment Estimate (v)"]; m_hat [label="Bias-corrected First Moment (m̂)"]; v_hat [label="Bias-corrected Second Moment (v̂)"]; update [label="Weight Update"]; weights [label="Weights"]; // Connect nodes params -> grad [label="controls"]; grad -> m [label="updates"]; grad -> v [label="updates"]; m -> m_hat [label="bias correction"]; v -> v_hat [label="bias correction"]; m_hat -> update [label="uses"]; v_hat -> update [label="uses"]; update -> weights [label="applies"]; // Styles params [style=filled, fillcolor=lightblue]; update [style=filled, fillcolor=yellow]; weights [style=filled, fillcolor=green]; } ```