alibi#
Core ALiBi classes and functions.
Functions
Removes position embeddings and replaces the attention function and attention mask as per |
Classes
ALiBi (Attention with Linear Biases; Press et al, 2021) dispenses with position embeddings and instead directly biases attention matrices such that nearby tokens attend to one another more strongly. |
- class composer.algorithms.alibi.alibi.Alibi(max_sequence_length, train_sequence_length_scaling=0.25)[source]#
Bases:
composer.core.algorithm.Algorithm
ALiBi (Attention with Linear Biases; Press et al, 2021) dispenses with position embeddings and instead directly biases attention matrices such that nearby tokens attend to one another more strongly.
ALiBi yields excellent extrapolation to unseen sequence lengths compared to other position embedding schemes. We leverage this extrapolation capability by training with shorter sequence lengths, which reduces the memory and computation load.
This algorithm runs on
Event.INIT
to modify the model before the model has been moved to accelerators. It also runs onEvent.AFTER_DATALOADER
to modify the shape of a batch of data after the model and data have been moved to accelerators.See the Method Card for more details.
Example:
from composer.algorithms import Alibi from composer.trainer import Trainer alibi = Alibi( max_sequence_length=512, train_sequence_length_scaling=0.25, ) trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="1ep", algorithms=[alibi] )
- Parameters
max_sequence_length (int) โ Maximum sequence length that the model will be able to accept. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.
train_sequence_length_scaling (float, optional) โ Amount by which to scale training sequence length. One batch of training data will be reshaped from shape \((sequence\_length, batch)\) to \((sequence\_length \times train\_sequence\_length\_scaling, \frac{batch}{train\_sequence\_length\_scaling})\). Default:
0.25
.
- composer.algorithms.alibi.alibi.apply_alibi(model, max_sequence_length, optimizers=None)[source]#
Removes position embeddings and replaces the attention function and attention mask as per
Alibi
. Note that the majority of the training speed-up from using ALiBi comes from being able to train on shorter sequence lengths; this function does not scale the training sequence length asAlibi
does, so little speedup will be observed from using it alone. See the Method Card for more details. This function should be called after the model is instantiated and before training begins.Example:
import composer.functional as cf cf.apply_alibi( model=model, max_sequence_length=512 )
- Parameters
model (Module) โ Model to transform.
max_sequence_length (int) โ
Maximum sequence length that the model will be able to accept. Internally, the transformations applied by alibi change sequence-shaped tensors to handle sequences up to
max_sequence_length
. Depending onmax_sequence_length
andmodel
these changes could increase or decrease the modelโs maximum sequence length.At minimum,
max_sequence_length
should be set to the sequence length used during training. However, if evaluating on sequence lengths longer than those used in training,max_sequence_length
should be set accordingly.Note that larger
max_sequence_length
means a larger memory footprint of the model. So, it is best to set this parameter equal the longest sequence length that will be seen during training and/or evaluation.optimizers (Optimizer | Sequence[Optimizer], optional) โ
Existing optimizers bound to
model.parameters()
. All optimizers that have already been constructed withmodel.parameters()
must be specified here so they will optimize the correct parameters.If the optimizer(s) are constructed after calling this function, then it is safe to omit this parameter. These optimizers will see the correct model parameters.