๐ Schedulers#
The Trainer
supports both PyTorch torch.optim.lr_scheduler
schedulers
as well as our own schedulers, which take advantage of the Time
representation.
For pytorch schedulers, we step every epoch by default. To instead step every batch, set
use_stepwise_scheduler=True
:
from torch.optim.lr_scheduler import CosineAnnealingLR
from composer import Trainer
trainer = Trainer(
schedulers=CosineAnnealingLR(...),
use_stepwise_scheduler=True,
)
Note
If setting use_stepwise_schedulers
to True
, remember to specify the
arguments to your pytorch scheduler in units of batches, not epochs.
Our experiments have shown better accuracy using stepwise schedulers, and so is the recommended setting in most cases.
Composer Schedulers#
Our schedulers differ from the pytorch schedulers in two ways:
Time parameters can be provided in different units: samples (
"sp"
), tokens ("tok"
), batches ("ba"
), epochs ("ep"
), and duration ("dur"
). See Time.our schedulers are functions, not classes. They return a multiplier to apply to the optimizerโs learning rate, given the current trainer state, and optionally a โscale schedule ratioโ (ssr).
For example, the following are equivalent:
from composer.optim.scheduler import step_scheduler
# assume trainer has max_duration=50 epochs
scheduler1 = lambda state: step_scheduler(state, step_size=['5ep', '25ep'])
scheduler2 = lambda state: step_scheduler(state, step_size=['0.1dur', '0.5dur'])
trainer = Trainer(
...
schedulers = scheduler1
)
These schedulers typically read the state.timer
to determine the trainerโs progress
and return a learning rate multipler. Inside the Trainer, we convert these to
torch.optim.lr_scheduler.LabmdaLR
schedulers. By default, our schedulers
have use_stepwise_scheduler=True
.
Below are the supported schedulers found at composer.optim.scheduler
.
Decays the learning rate discretely at fixed intervals. |
|
Decays the learning rate discretely at fixed milestones. |
|
Decays the learning rate discretely at fixed milestones, with a linear warmup. |
|
Maintains a fixed learning rate. |
|
Adjusts the learning rate linearly. |
|
Adjusts the learning rate linearly, with a linear warmup. |
|
Decays the learning rate exponentially. |
|
Decays the learning rate according to the decreasing part of a cosine curve. |
|
Decays the learning rate according to the decreasing part of a cosine curve, with a linear warmup. |
|
Cyclically decays the learning rate according to the decreasing part of a cosine curve. |
|
Sets the learning rate to be exponentially proportional to the percentage of training time left. |
Scale Schedule Ratio#
The Scale Schedule Ratio (SSR) scales the learning rate schedule by a factor, and
is a powerful way to tradeoff training time and quality. ssr
is an argument to
the Trainer
.