composer.optim
DecoupledSGDW
- class composer.optim.DecoupledSGDW(params: List[torch.Tensor], lr: float = <required parameter>, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)[source]
Bases:
torch.optim.sgd.SGD
SGD optimizer with the weight decay term decoupled from the learning rate.
The standard SGD optimizer couples the weight decay term with the gradient calculation. This ties the optimal value of
weight_decay
tolr
and can also hurt generalization in practice. For more details on why decoupling might be desirable, see “Decoupled Weight Decay Regularization”.- Parameters
params (list) – List of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – Learning rate.
momentum (int, optional) – Momentum factor. Default:
0
.dampening (int, optional) – Dampening factor applied to the momentum. Default:
0
.weight_decay (int, optional) – Decoupled weight decay factor. Default:
0
.nesterov (bool, optional) – Enables Nesterov momentum updates. Default:
False
.
DecoupledAdamW
- class composer.optim.DecoupledAdamW(params: List[torch.Tensor], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)[source]
Bases:
torch.optim.adamw.AdamW
Adam optimizer with the weight decay term decoupled from the learning rate.
The standard AdamW optimizer explicitly couples the weight decay term with the learning rate. This ties the optimal value of
weight_decay
tolr
and can also hurt generalization in practice. For more details on why decoupling might be desirable, see “Decoupled Weight Decay Regularization”.- Parameters
params (list) – List of parameters to update.
lr (float, optional) – Learning rate. Default:
1e-3
.betas (tuple, optional) – Coefficients used for computing running averages of gradient and its square Default:
(0.9, 0.999)
.eps (float, optional) – Term added to the denominator to improve numerical stability. Default:
1e-8
.weight_decay (float, optional) – Decoupled weight decay factor. Default:
1e-2
.amsgrad (bool, optional) – Enables the amsgrad variant of Adam. Default:
False
.
composer.optim.scheduler
WarmUpLR
- class composer.optim.WarmUpLR(optimizer, warmup_factor=0.3333333333333333, warmup_iters=5, warmup_method='linear', last_epoch=- 1, verbose=False, interval='step')[source]
Decays the learning rate of each parameter group by either a small constant or linearly increasing small warmup factor until the number of epoch reaches a pre-defined milestone:
warmup_iters
.This scheduler is adapted from PyTorch but rewritten in a non-chainable form to accomodate
warmup_factor=0.0
. Whenlast_epoch=-1
, sets initial lr as lr.- Parameters
optimizer (Optimizer) – Wrapped optimizer.
warmup_factor (float) – The number we multiply learning rate by in the first epoch. If the warming up method is constant, the multiplication factor of the learning rate stays the same in all epochs, but, in the linear case, it starts increasing in the following epochs. Default:
1./3
.warmup_iters (int) – The number of warming up steps. Default:
5
.warmup_method (str) – One of
constant
andlinear
. Inconstant
mode, the learning rate will be multiplied with a small constant until a milestone defined in :attr:warmup_iters
. In thelinear
case, the multiplication factor starts with :attr:warmup_factor
in the first epoch then linearly increases to reach 1. in the epoch number :attr:warmup_iters
. Default:linear
.last_epoch (int) – The index of the last epoch. Can be used to restore the state of the learning rate schedule. Default:
-1
.verbose (bool) – If
True
, prints a message to stdout for each update. Default:False
.interval (str) – Frequency of
step()
calls, eitherstep
orepoch
. Default:step
.
Example
>>> # Assuming optimizer uses lr = 0.05 for all groups >>> # lr = 0.025 if epoch == 0 >>> # lr = 0.03125 if epoch == 1 >>> # lr = 0.0375 if epoch == 2 >>> # lr = 0.04375 if epoch == 3 >>> # lr = 0.05 if epoch >= 4 >>> scheduler = WarmUpLR(self.opt, warmup_factor=0.5, warmup_iters=4, warmup_method="linear") >>> for epoch in range(100): >>> train(...) >>> validate(...) >>> scheduler.step()
>>> # Assuming optimizer uses lr = 0.05 for all groups >>> # lr = 0.025 if epoch == 0 >>> # lr = 0.025 if epoch == 1 >>> # lr = 0.025 if epoch == 2 >>> # lr = 0.025 if epoch == 3 >>> # lr = 0.05 if epoch >= 4 >>> scheduler = WarmUpLR(self.opt, warmup_factor=0.5, warmup_iters=4, warmup_method="constant") >>> for epoch in range(100): >>> train(...) >>> validate(...) >>> scheduler.step()
ConstantLR
- class composer.optim.ConstantLR(optimizer: torch.optim.optimizer.Optimizer, last_epoch: int = - 1, verbose: int = False)[source]
Scheduler that does not change the optimizer’s learning rate.
- Parameters
optimizer (Optimizer) – the optimizer associated with this scheduler.
last_epoch (int, optional) – The index of the last epoch. Can be used to restore the state of the learning rate schedule. Default:
-1
.verbose (bool, optional) – If
True
, prints a message to stdout for each update. Default:False
.
ComposedScheduler
- class composer.optim.ComposedScheduler(schedulers)[source]
Handles warmup for a chained list of schedulers.
With one call, will run each scheduler’s
step()
. IfWarmUpLR
is in the list, will delay the stepping of schedulers that need to be silent during warmup.ComposedScheduler
handles warmups, where as ChainedScheduler only combines schedulers.CosineAnnealingLR and ExponentialLR are not stepped during the warmup period. Other schedulers, such as MultiStepLR are still stepped, to keep their milestones unchanged.
Handles running the
WarmUpLR
at every step ifWarmUpLR.interval='batch'
, and other schedulers at every epoch.- Parameters
schedulers (list) – List of chained schedulers.
Example
>>> # Assuming optimizer uses lr = 1. for all groups >>> # lr = 0.1 if epoch == 0 >>> # lr = 0.1 if epoch == 1 >>> # lr = 0.9 if epoch == 2 # ExponentialLR effect starts here >>> # lr = 0.81 if epoch == 3 >>> # lr = 0.729 if epoch == 4 >>> scheduler1 = WarmUpLR(self.opt, warmup_factor=0.1, warmup_iters=2, warmup_method="constant") >>> scheduler2 = ExponentialLR(self.opt, gamma=0.9) >>> scheduler = ComposedScheduler(zip([scheduler1, scheduler2], ["epoch", "epoch"])) >>> for epoch in range(100): >>> train(...) >>> validate(...) >>> scheduler.step()
>>> # Assuming optimizer uses lr = 1. for all groups >>> # lr = 0.1 if epoch == 0 >>> # lr = 0.1 if epoch == 1 >>> # lr = 1.0 if epoch == 2 >>> # lr = 1.0 if epoch == 3 >>> # lr = 0.2 if epoch == 4 . # MultiStepLR effect starts here >>> scheduler1 = WarmUpLR(self.opt, warmup_factor=0.1, warmup_iters=2, warmup_method="constant") >>> scheduler2 = MultiStepLR(optimizer, milestones=[4], gamma=0.2) >>> scheduler = ComposedScheduler(zip([scheduler1, scheduler2], ["epoch", "epoch"])) >>> for epoch in range(100): >>> train(...) >>> validate(...) >>> scheduler.step()