composer.optim

DecoupledSGDW

class composer.optim.DecoupledSGDW(params: List[torch.Tensor], lr: float = <required parameter>, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False)[source]

Bases: torch.optim.sgd.SGD

SGD optimizer with the weight decay term decoupled from the learning rate.

The standard SGD optimizer couples the weight decay term with the gradient calculation. This ties the optimal value of weight_decay to lr and can also hurt generalization in practice. For more details on why decoupling might be desirable, see “Decoupled Weight Decay Regularization”.

Parameters
  • params (list) – List of parameters to optimize or dicts defining parameter groups.

  • lr (float, optional) – Learning rate.

  • momentum (int, optional) – Momentum factor. Default: 0.

  • dampening (int, optional) – Dampening factor applied to the momentum. Default: 0.

  • weight_decay (int, optional) – Decoupled weight decay factor. Default: 0.

  • nesterov (bool, optional) – Enables Nesterov momentum updates. Default: False.

DecoupledAdamW

class composer.optim.DecoupledAdamW(params: List[torch.Tensor], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False)[source]

Bases: torch.optim.adamw.AdamW

Adam optimizer with the weight decay term decoupled from the learning rate.

The standard AdamW optimizer explicitly couples the weight decay term with the learning rate. This ties the optimal value of weight_decay to lr and can also hurt generalization in practice. For more details on why decoupling might be desirable, see “Decoupled Weight Decay Regularization”.

Parameters
  • params (list) – List of parameters to update.

  • lr (float, optional) – Learning rate. Default: 1e-3.

  • betas (tuple, optional) – Coefficients used for computing running averages of gradient and its square Default: (0.9, 0.999).

  • eps (float, optional) – Term added to the denominator to improve numerical stability. Default: 1e-8.

  • weight_decay (float, optional) – Decoupled weight decay factor. Default: 1e-2.

  • amsgrad (bool, optional) – Enables the amsgrad variant of Adam. Default: False.

composer.optim.scheduler

WarmUpLR

class composer.optim.WarmUpLR(optimizer, warmup_factor=0.3333333333333333, warmup_iters=5, warmup_method='linear', last_epoch=- 1, verbose=False, interval='step')[source]

Decays the learning rate of each parameter group by either a small constant or linearly increasing small warmup factor until the number of epoch reaches a pre-defined milestone: warmup_iters.

This scheduler is adapted from PyTorch but rewritten in a non-chainable form to accomodate warmup_factor=0.0. When last_epoch=-1, sets initial lr as lr.

Parameters
  • optimizer (Optimizer) – Wrapped optimizer.

  • warmup_factor (float) – The number we multiply learning rate by in the first epoch. If the warming up method is constant, the multiplication factor of the learning rate stays the same in all epochs, but, in the linear case, it starts increasing in the following epochs. Default: 1./3.

  • warmup_iters (int) – The number of warming up steps. Default: 5.

  • warmup_method (str) – One of constant and linear. In constant mode, the learning rate will be multiplied with a small constant until a milestone defined in :attr:warmup_iters. In the linear case, the multiplication factor starts with :attr:warmup_factor in the first epoch then linearly increases to reach 1. in the epoch number :attr:warmup_iters. Default: linear.

  • last_epoch (int) – The index of the last epoch. Can be used to restore the state of the learning rate schedule. Default: -1.

  • verbose (bool) – If True, prints a message to stdout for each update. Default: False.

  • interval (str) – Frequency of step() calls, either step or epoch. Default: step.

Example

>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.025    if epoch == 0
>>> # lr = 0.03125  if epoch == 1
>>> # lr = 0.0375   if epoch == 2
>>> # lr = 0.04375  if epoch == 3
>>> # lr = 0.05    if epoch >= 4
>>> scheduler = WarmUpLR(self.opt, warmup_factor=0.5, warmup_iters=4, warmup_method="linear")
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.025    if epoch == 0
>>> # lr = 0.025    if epoch == 1
>>> # lr = 0.025    if epoch == 2
>>> # lr = 0.025    if epoch == 3
>>> # lr = 0.05    if epoch >= 4
>>> scheduler = WarmUpLR(self.opt, warmup_factor=0.5, warmup_iters=4, warmup_method="constant")
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

ConstantLR

class composer.optim.ConstantLR(optimizer: torch.optim.optimizer.Optimizer, last_epoch: int = - 1, verbose: int = False)[source]

Scheduler that does not change the optimizer’s learning rate.

Parameters
  • optimizer (Optimizer) – the optimizer associated with this scheduler.

  • last_epoch (int, optional) – The index of the last epoch. Can be used to restore the state of the learning rate schedule. Default: -1.

  • verbose (bool, optional) – If True, prints a message to stdout for each update. Default: False.

ComposedScheduler

class composer.optim.ComposedScheduler(schedulers)[source]

Handles warmup for a chained list of schedulers.

With one call, will run each scheduler’s step(). If WarmUpLR is in the list, will delay the stepping of schedulers that need to be silent during warmup. ComposedScheduler handles warmups, where as ChainedScheduler only combines schedulers.

CosineAnnealingLR and ExponentialLR are not stepped during the warmup period. Other schedulers, such as MultiStepLR are still stepped, to keep their milestones unchanged.

Handles running the WarmUpLR at every step if WarmUpLR.interval='batch', and other schedulers at every epoch.

Parameters

schedulers (list) – List of chained schedulers.

Example

>>> # Assuming optimizer uses lr = 1. for all groups
>>> # lr = 0.1      if epoch == 0
>>> # lr = 0.1      if epoch == 1
>>> # lr = 0.9      if epoch == 2  # ExponentialLR effect starts here
>>> # lr = 0.81     if epoch == 3
>>> # lr = 0.729    if epoch == 4
>>> scheduler1 = WarmUpLR(self.opt, warmup_factor=0.1, warmup_iters=2, warmup_method="constant")
>>> scheduler2 = ExponentialLR(self.opt, gamma=0.9)
>>> scheduler = ComposedScheduler(zip([scheduler1, scheduler2], ["epoch", "epoch"]))
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
>>> # Assuming optimizer uses lr = 1. for all groups
>>> # lr = 0.1      if epoch == 0
>>> # lr = 0.1      if epoch == 1
>>> # lr = 1.0      if epoch == 2
>>> # lr = 1.0     if epoch == 3
>>> # lr = 0.2    if epoch == 4 . # MultiStepLR effect starts here
>>> scheduler1 = WarmUpLR(self.opt, warmup_factor=0.1, warmup_iters=2, warmup_method="constant")
>>> scheduler2 = MultiStepLR(optimizer, milestones=[4], gamma=0.2)
>>> scheduler = ComposedScheduler(zip([scheduler1, scheduler2], ["epoch", "epoch"]))
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()