composer.optim.scheduler#
Stateless learning rate schedulers.
Stateless schedulers solve some of the problems associated with PyTorch’s built-in schedulers provided in
torch.optim.lr_scheduler
. The primary design goal of the schedulers provided in this module is to allow
schedulers to interface directly with Composer’s time
abstraction. This means that schedulers can
be configured using arbitrary but explicit time units.
See ComposerScheduler
for more information on stateless schedulers.
Functions
Converts a stateless scheduler into a PyTorch scheduler object. |
Classes
Specification for a stateless scheduler function. |
|
Maintains a fixed learning rate. |
|
Decays the learning rate according to the decreasing part of a cosine curve. |
|
Cyclically decays the learning rate according to the decreasing part of a cosine curve. |
|
Decays the learning rate according to the decreasing part of a cosine curve, with an initial warmup. |
|
Decays the learning rate exponentially. |
|
Adjusts the learning rate linearly. |
|
Adjusts the learning rate linearly, with an initial warmup. |
|
Decays the learning rate discretely at fixed milestones. |
|
Decays the learning rate discretely at fixed milestones, with an initial warmup. |
|
Sets the learning rate to be proportional to a power of the fraction of training time left. |
|
Decays the learning rate discretely at fixed intervals. |
- class composer.optim.scheduler.ComposerScheduler(*args, **kwargs)[source]#
Bases:
Protocol
Specification for a stateless scheduler function.
While this specification is provided as a Python class, an ordinary function can implement this interface as long as it matches the signature of this interface’s
__call__()
method.For example, a scheduler that halves the learning rate after 10 epochs could be written as:
def ten_epoch_decay_scheduler(state: State) -> float: if state.timer.epoch < 10: return 1.0 return 0.5 # ten_epoch_decay_scheduler is a valid ComposerScheduler trainer = Trainer( schedulers=[ten_epoch_decay_scheduler], ... )
In order to allow schedulers to be configured, schedulers may also written as callable classes:
class VariableEpochDecayScheduler(ComposerScheduler): def __init__(num_epochs: int): self.num_epochs = num_epochs def __call__(state: State) -> float: if state.time.epoch < self.num_epochs: return 1.0 return 0.5 ten_epoch_decay_scheduler = VariableEpochDecayScheduler(num_epochs=10) # ten_epoch_decay_scheduler is also a valid ComposerScheduler trainer = Trainer( schedulers=[ten_epoch_decay_scheduler], ... )
The constructions of
ten_epoch_decay_scheduler
in each of the examples above are equivalent. Note that neither scheduler uses thescale_schedule_ratio
parameter. As long as this parameter is not used when initializingTrainer
, it is not required that any schedulers implement that parameter.- __call__(state, ssr=1.0)[source]#
Calculate the current learning rate multiplier \(\alpha\).
A scheduler function should be a pure function that returns a multiplier to apply to the optimizer’s provided learning rate, given the current trainer state, and optionally a “scale schedule ratio” (SSR). A typical implementation will read
state.timer
, and possibly other fields likestate.max_duration
, to determine the trainer’s latest temporal progress.Note
All instances of
ComposerScheduler
output a multiplier for the learning rate, rather than the learning rate directly. By convention, we use the symbol \(\alpha\) to refer to this multiplier. This means that the learning rate \(\eta\) at time \(t\) can be represented as \(\eta(t) = \eta_i \times \alpha(t)\), where \(\eta_i\) represents the learning rate used to initialize the optimizer.Note
It is possible to use multiple schedulers, in which case their effects will stack multiplicatively.
The
ssr
param indicates that the schedule should be “stretched” accordingly. In symbolic terms, where \(\alpha_\sigma(t)\) represents the scheduler output at time \(t\) using scale schedule ratio \(\sigma\):\[\alpha_{\sigma}(t) = \alpha(t / \sigma) \]- Parameters
- Returns
alpha (float) – A multiplier to apply to the optimizer’s provided learning rate.
- class composer.optim.scheduler.ConstantScheduler(alpha=1.0, t_max='1dur')[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Maintains a fixed learning rate.
This scheduler is based on
ConstantLR
from PyTorch.The default settings for this scheduler simply maintain a learning rate factor of 1 for the entire training duration. However, both the factor and the duration of this scheduler can be configured.
Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \begin{cases} \alpha, & \text{if } t < t_{max} \\ 1.0 & \text{otherwise} \end{cases} \]Where \(\alpha\) represents the learning rate multiplier to maintain while this scheduler is active, and \(t_{max}\) represents the duration of this scheduler.
- class composer.optim.scheduler.CosineAnnealingScheduler(t_max='1dur', alpha_f=0.0)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate according to the decreasing part of a cosine curve.
See also
This scheduler is based on
CosineAnnealingLR
from PyTorch.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \alpha_f + (1 - \alpha_f) \times \frac{1}{2} (1 + \cos(\pi \times \tau)) \]Given \(\tau\), the fraction of time elapsed (clipped to the interval \([0, 1]\)), as:
\[\tau = t / t_{max} \]Where \(t_{max}\) represents the duration of this scheduler, and \(\alpha_f\) represents the learning rate multiplier to decay to.
- class composer.optim.scheduler.CosineAnnealingWarmRestartsScheduler(t_0, t_mult=1.0, alpha_f=0.0)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Cyclically decays the learning rate according to the decreasing part of a cosine curve.
See also
This scheduler is based on
CosineAnnealingWarmRestarts
from PyTorch.This scheduler resembles a regular cosine annealing curve, as seen in
CosineAnnealingScheduler
, except that after the curve first completest_0
time, the curve resets to the start. The durations of subsequent cycles are each multiplied byt_mult
.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \alpha_f + (1 - \alpha_f) \times \frac{1}{2}(1 + \cos(\pi \times \tau_i)) \]Given \(\tau_i\), the fraction of time elapsed through the \(i^\text{th}\) cycle, as:
\[\tau_i = (t - \sum_{j=0}^{i-1} t_0 t_{mult}^j) / (t_0 t_{mult}^i) \]Where \(t_0\) represents the period of the first cycle, \(t_{mult}\) represents the multiplier for the duration of successive cycles, and \(\alpha_f\) represents the learning rate multiplier to decay to.
- class composer.optim.scheduler.CosineAnnealingWithWarmupScheduler(t_warmup, t_max='1dur', alpha_f=0.0)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate according to the decreasing part of a cosine curve, with an initial warmup.
See also
This scheduler is based on
CosineAnnealingScheduler
, with an added warmup.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha_f + (1 - \alpha_f) \times \frac{1}{2} (1 + \cos(\pi \times \tau_w)) & \text{otherwise} \end{cases} \]Given \(\tau_w\), the fraction of post-warmup time elpased (clipped to the interval \([0, 1]\)), as:
\[\tau_w = (t - t_{warmup}) / t_{max} \]Where \(t_{warmup}\) represents the warmup time, \(t_{max}\) represents the duration of this scheduler, and \(\alpha_f\) represents the learning rate multiplier to decay to.
Warning
Initial warmup time is not scaled according to any provided scale schedule ratio! However, the duration of the scheduler is still scaled accordingly. To achieve this, after warmup, the scheduler’s “pace” will be slightly distorted from what would otherwise be expected.
- class composer.optim.scheduler.ExponentialScheduler(gamma, decay_period='1ep')[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate exponentially.
See also
This scheduler is based on
ExponentialLR
from PyTorch.Exponentially decays the learning rate such that it decays by a factor of
gamma
everydecay_period
time.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \gamma ^ {t / \rho} \]Where \(\rho\) represents the decay period, and \(\gamma\) represents the multiplicative decay factor.
- class composer.optim.scheduler.LinearScheduler(alpha_i=1.0, alpha_f=0.0, t_max='1dur')[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Adjusts the learning rate linearly.
See also
This scheduler is based on
LinearLR
from PyTorch.Warning
Note that the defaults for this scheduler differ from the defaults for
LinearLR
. The PyTorch scheduler, by default, linearly increases the learning rate multiplier from 1.0 / 3 to 1.0, whereas this implementation, by default, linearly decreases the multiplier rom 1.0 to 0.0.Linearly adjusts the learning rate multiplier from
alpha_i
toalpha_f
overt_{max}
time.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \alpha_i + (alpha_f - \alpha_i) \times \tau \]Given \(\tau\), the fraction of time elapsed (clipped to the interval \([0, 1]\)), as:
\[\tau = t / t_{max} \]Where \(\alpha_i\) represents the initial learning rate multiplier, \(\alpha_f\) represents the learning rate multiplier to decay to, and \(t_{max}\) represents the duration of this scheduler.
- class composer.optim.scheduler.LinearWithWarmupScheduler(t_warmup, alpha_i=1.0, alpha_f=0.0, t_max='1dur')[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Adjusts the learning rate linearly, with an initial warmup.
See also
This scheduler is based on
LinearScheduler
, with an added warmup.Linearly adjusts the learning rate multiplier from
alpha_i
toalpha_f
overt_{max}
time.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha_i + (alpha_f - \alpha_i) \times \tau_w & \text{otherwise} \end{cases} \]Given \(\tau_w\), the fraction of post-warmup time elpased (clipped to the interval \([0, 1]\)), as:
\[\tau_w = (t - t_{warmup}) / t_{max} \]Where \(t_{warmup}\) represents the warmup time, \(\alpha_i\) represents the initial learning rate multiplier, and \(\alpha_f\) represents the learning rate multiplier to decay to, and \(t_{max}\) represents the duration of this scheduler.
Warning
Initial warmup time is not scaled according to any provided scale schedule ratio! However, the duration of the scheduler is still scaled accordingly. To achieve this, after warmup, the scheduler’s “pace” will be slightly distorted from what would otherwise be expected.
- class composer.optim.scheduler.MultiStepScheduler(milestones, gamma=0.1)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate discretely at fixed milestones.
See also
This scheduler is based on
MultiStepLR
from PyTorch.Decays the learning rate by a factor of
gamma
whenever a time milestone inmilestones
is reached.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \gamma ^ x \]Where \(x\) represents the amount of milestones that have been reached, and \(\gamma\) represents the multiplicative decay factor.
- class composer.optim.scheduler.MultiStepWithWarmupScheduler(t_warmup, milestones, gamma=0.1)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate discretely at fixed milestones, with an initial warmup.
See also
This scheduler is based on
MultiStepScheduler
, with an added warmup.Starts with a linear warmup over
t_warmup
time, then decays the learning rate by a factor ofgamma
whenever a time milestone inmilestones
is reached.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \gamma ^ x & \text{otherwise} \end{cases} \]Where \(t_{warmup}\) represents the warmup time, \(x\) represents the amount of milestones that have been reached, and \(\gamma\) represents the multiplicative decay factor.
Warning
All milestones should be greater than
t_warmup
; otherwise, they will have no effect on the computed learning rate multiplier until the warmup has completed.Warning
Initial warmup time is not scaled according to any provided scale schedule ratio! However, the milestones will still be scaled accordingly.
- class composer.optim.scheduler.PolynomialScheduler(power, t_max='1dur', alpha_f=0.0)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Sets the learning rate to be proportional to a power of the fraction of training time left.
Specifially, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \alpha_f + (1 - \alpha_f) \times (1 - \tau) ^ {\kappa} \]Given \(\tau\), the fraction of time elapsed (clipped to the interval \([0, 1]\)), as:
\[\tau = t / t_{max} \]Where \(\kappa\) represents the exponent to be used for the proportionality relationship, \(t_{max}\) represents the duration of this scheduler, and \(\alpha_f\) represents the learning rate multiplier to decay to.
- class composer.optim.scheduler.StepScheduler(step_size, gamma=0.1)[source]#
Bases:
composer.optim.scheduler.ComposerScheduler
Decays the learning rate discretely at fixed intervals.
See also
This scheduler is based on
StepLR
from PyTorch.Decays the learning rate by a factor of
gamma
periodically, with a frequency determined bystep_size
.Specifically, the learning rate multiplier \(\alpha\) can be expressed as:
\[\alpha(t) = \gamma ^ {\text{floor}(t / \rho)} \]Where \(\rho\) represents the time between changes to the learning rate (the step size), and \(\gamma\) represents the multiplicative decay factor.
- composer.optim.scheduler.compile_composer_scheduler(scheduler, state, ssr=1.0)[source]#
Converts a stateless scheduler into a PyTorch scheduler object.
While the resulting scheduler provides a
.step()
interface similar to other PyTorch schedulers, the scheduler is also given a bound reference to the currentState
. This means that any internal state updated by.step()
can be ignored, and the scheduler can instead simply use the bound state to recalculate the current learning rate.- Parameters
scheduler (ComposerScheduler) – A stateless scheduler, provided as a
ComposerScheduler
object.state (State) – The Composer Trainer’s state.
- Returns
compiled_scheduler (PyTorchScheduler) – The scheduler, in a form compatible with PyTorch scheduler interfaces.