Decoupled Weight Decay
Tags: Best Practice
, Increased Accuracy
, Regularization
TL;DR
L2 regularization is typically considered equivalent to weight decay, but this equivalence only holds for certain optimizer implementations. Common optimizer implementations typically scale the weight decay by the learning rate, which complicates model tuning and hyperparameter sweeps by coupling learning rate and weight decay. Implementing weight decay explicitly and separately from L2 regularization allows for a new means of tuning regularization in models.
Attribution
Decoupled Weight Decay Regularization, by Ilya Loshchilov and Frank Hutter. Published as a conference paper at ICLR 2019.
Code and Hyperparameters
Unlike other methods, decoupled weight decay is not implemented as an algorithm, but instead provides two optimizers that can be used in place of existing common optimizers.
DecoupledSGDW
optimizer (same as hyperparameters for torch.optim.SGD
):
lr
- Learning rate.momentum
- Momentum factor.weight_decay
- Weight decay.dampening
- Dampening for momentum.nesterov
- Nesterov momentum.
DecoupledAdamW
optimizer (same as hyperparameters for torch.optim.Adam
)
lr
- Learning rate.betas
- Coefficients used for computing running averages of gradient and its square.eps
- Term for numerical stability.weight_decay
- Weight decay.amsgrad
- Use AMSGrad variant.
Applicable Settings
Using decoupled weight decay is considered a best practice in most settings. DecoupledSGDW
and DecoupledAdamW
should always be used in place of their vanilla counterparts.
Implementation Details
Unlike most of our other methods, we do not implement decoupled weight decay as an algorithm, instead providing optimizers that can be used as drop-in replacements for torch.optim.SGD
and torch.optim.Adam
, though note that some hyperparameter tuning may be required to realize full performance improvements.
The informed reader may note that Pytorch already provides a torch.optim.AdamW
variant that implements Loshchilov et al.’s method. Unfortunately, this implementation has a fundamental bug owing to Pytorch’s method of handling learning rate scheduling. In this implementation, learning rate schedulers attempt to schedule the weight decay (as Loschilov et al. suggest) by tying it to the learning rate. However, this means that weight decay is now implicitly tied to the initial learning rate, resulting in unexpected behavior where runs with different learning rates also have different effective weight decays. See this line.
Considerations
There are no known negative side effects to using decoupled weight decay once it is properly tuned, as long as the original base optimizer is either torch.optim.Adam
or torch.optim.SGD
.
Composability
Weight decay is a regularization technique, and thus is expected to yield diminishing returns when composed with other regularization techniques.