Stochastic gradient descent (SGD) is the engine of deep learning: compute gradients on a mini-batch, update parameters, repeat. Mini-batches make training feasible, but the gradient noise can slow convergence and sometimes trigger instability. When models become large—especially transformer-based systems—optimiser choice strongly affects training reliability, scaling behaviour, and final model quality. If you are learning large-model training in a gen AI course, it helps to treat the optimiser as part of the overall system design rather than a last-minute hyperparameter.
Most optimiser variants differ in two mechanisms: momentum (to smooth noisy gradients) and adaptivity (to adjust step sizes per parameter using recent gradient statistics). The most common choices in modern practice are SGD with momentum, RMSprop, and Adam-family methods such as AdamW.
1) SGD with momentum: the classical baseline
Plain SGD applies a single learning rate to all parameters. Momentum adds a running “velocity” that averages recent gradients. This reduces oscillation in steep directions and speeds progress along directions with consistent signal.
SGD with momentum is memory-light and often generalises strongly when paired with a good learning-rate schedule (for example, warmup followed by decay). The trade-off is tuning: a single global learning rate can be mismatched across layers and parameter groups, and large transformers often need careful schedules to avoid early divergence.
2) Adaptive learning rates: RMSprop, Adam, and AdamW
Adaptive optimisers rescale updates per parameter. The goal is to prevent parameters with consistently large gradients from taking overly large steps, while still allowing slow-moving parameters to update.
RMSprop maintains an exponential moving average of squared gradients and divides the current gradient by the root of that average. This normalises updates using a recent estimate of gradient scale, which can help on non-stationary objectives.
Adam combines a momentum-like average (the first moment) with an RMSprop-like average (the second moment), plus bias correction for early steps. It often reaches a useful loss region quickly, which is why it is common in large-model training and rapid experimentation.
AdamW is a practical refinement because it decouples weight decay from the adaptive update. In classic Adam, L2 regularisation can interact with adaptivity and change the effective regularisation strength across parameters. AdamW applies weight decay directly to the weights, which usually gives more predictable behaviour in transformer training.
3) How optimiser choice changes large-model training dynamics
Early stability and warmup
Large transformers are sensitive in the first few thousand steps, especially in fp16/bf16. Warmup is common. Adaptive methods (notably AdamW) often reduce early instability because per-parameter scaling limits the impact of sporadic large gradients.
Batch size scaling and gradient noise
As batch size increases, gradient noise drops. SGD can benefit strongly if learning rate and momentum are scaled appropriately. Adaptive methods can look less sensitive at first, but they still need schedule discipline. With extremely large batches, AdamW may converge fast yet require stronger regularisation to avoid brittle convergence.
Generalisation and fine-tuning behaviour
The slogan “SGD generalises better” is sometimes true, but not universal. Many pretraining pipelines use AdamW for stability and time-to-quality. Fine-tuning can differ: parameter-efficient approaches (adapters/LoRA) often pair well with AdamW, while very small datasets may benefit from more conservative updates. In a gen AI course, it is better to decide using validation metrics for the stage you are in than to rely on a single rule of thumb.
Systems cost: optimiser state
Adam-style methods store extra optimiser state (first and second moments), increasing memory and checkpoint size compared with SGD. At scale, optimiser state can constrain sharding and checkpoint frequency. Memory-saving variants (such as Adafactor) reduce footprint, trading simplicity for state compression.
4) Practical selection checklist
- Start with AdamW for transformer pretraining and most large-model fine-tuning where stability and rapid iteration matter.
- Choose SGD with momentum when optimiser memory is tight, or when you have a proven schedule and want strong generalisation.
- Use RMSprop as a simpler adaptive baseline if AdamW is unusually sensitive in your setup.
- Track more than loss: monitor gradient norms, update-to-weight ratios, and validation curves.
- Document hyperparameters and seeds; reproducibility is part of responsible practice in any gen AI course.
Conclusion
SGD, RMSprop, and Adam-family methods all aim to turn noisy mini-batch gradients into productive parameter updates, but they do so with different assumptions about noise and parameter scale. Momentum-based SGD is simple and memory-efficient, and can generalise well with strong schedules. AdamW and RMSprop typically offer smoother early training and practical robustness for large models, at the cost of additional optimiser state and occasional generalisation trade-offs. Knowing these differences helps you choose a stable, efficient setup for your next run, whether you are experimenting in a gen AI course or training at production scale.



