๐Ÿ“‰ Schedulers#

The Trainer supports both PyTorch torch.optim.lr_scheduler schedulers as well as our own schedulers, which take advantage of the Time representation.

For pytorch schedulers, we step every epoch by default. To instead step every batch, set use_stepwise_scheduler=True:

from torch.optim.lr_scheduler import CosineAnnealingLR
from composer import Trainer

trainer = Trainer(
    schedulers=CosineAnnealingLR(...),
    use_stepwise_scheduler=True,
)

Note

If setting use_stepwise_schedulers to True, remember to specify the arguments to your pytorch scheduler in units of batches, not epochs.

Our experiments have shown better accuracy using stepwise schedulers, and so is the recommended setting in most cases.

Composer Schedulers#

Our schedulers differ from the pytorch schedulers in two ways:

  • Time parameters can be provided in different units: samples ("sp"), tokens ("tok"), batches ("ba"), epochs ("ep"), and duration ("dur"). See Time.

  • our schedulers are functions, not classes. They return a multiplier to apply to the optimizerโ€™s learning rate, given the current trainer state, and optionally a โ€œscale schedule ratioโ€ (ssr).

For example, the following are equivalent:

from composer.optim.scheduler import step_scheduler

# assume trainer has max_duration=50 epochs

scheduler1 = lambda state: step_scheduler(state, step_size=['5ep', '25ep'])
scheduler2 = lambda state: step_scheduler(state, step_size=['0.1dur', '0.5dur'])

trainer = Trainer(
    ...
    schedulers = scheduler1
)

These schedulers typically read the state.timer to determine the trainerโ€™s progress and return a learning rate multipler. Inside the Trainer, we convert these to torch.optim.lr_scheduler.LabmdaLR schedulers. By default, our schedulers have use_stepwise_scheduler=True.

Below are the supported schedulers found at composer.optim.scheduler.

step_scheduler

Decays the learning rate discretely at fixed intervals.

multi_step_scheduler

Decays the learning rate discretely at fixed milestones.

multi_step_with_warmup_scheduler

Decays the learning rate discretely at fixed milestones, with a linear warmup.

constant_scheduler

Maintains a fixed learning rate.

linear_scheduler

Adjusts the learning rate linearly.

linear_with_warmup_scheduler

Adjusts the learning rate linearly, with a linear warmup.

exponential_scheduler

Decays the learning rate exponentially.

cosine_annealing_scheduler

Decays the learning rate according to the decreasing part of a cosine curve.

cosine_annealing_with_warmup_scheduler

Decays the learning rate according to the decreasing part of a cosine curve, with a linear warmup.

cosine_annealing_warm_restarts_scheduler

Cyclically decays the learning rate according to the decreasing part of a cosine curve.

polynomial_scheduler

Sets the learning rate to be exponentially proportional to the percentage of training time left.

Scale Schedule Ratio#

The Scale Schedule Ratio (SSR) scales the learning rate schedule by a factor, and is a powerful way to tradeoff training time and quality. ssr is an argument to the Trainer.