gated_linear_units#
Module gated_linear_units
.
Functions
Replaces the Linear layers in the feed-forward network with Gated Linear Units. |
|
Defines a replacement policy from a |
|
Defines a replacement policy from a |
Classes
Base class for algorithms. |
|
Defines a single feed-forward block that uses Gated Linear Units. |
|
|
Bert Model with a language modeling head on top. |
|
Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. |
|
Module |
|
Module |
Enum to represent training loop events. |
|
Replaces all instances of Linear layers in the feed-forward subnetwork with a Gated Linear Unit. |
|
A wrapper class that converts ๐ค Transformers models to composer models. |
|
An interface to record training data. |
|
The state of the trainer. |
Exceptions
Handles errors for external packages that might not be installed. |
|
Warns when an algorithm did not have an effect. |
Attributes
Callable
Dict
IS_TRANSFORMERS_INSTALLED
Optional
Sequence
Type
Union
annotations
log
- class composer.algorithms.gated_linear_units.gated_linear_units.GatedLinearUnits(act_fn=None, gated_layer_bias=False, non_gated_layer_bias=False)[source]#
Bases:
composer.core.algorithm.Algorithm
Replaces all instances of Linear layers in the feed-forward subnetwork with a Gated Linear Unit. The Gated Linear Units provide a more expressive form for the same number of parameters, and a slight degredation to throughput.
Runs on
Event.INIT
, so it can swap the Linear layers in the FFN for GLUs before the model is DDP wrapped.- Parameters
act_fn (Callable[[Tensor], Tensor], optional) โ Optionally, the activation function to use. If
None
, the algorithm will use the existing activation function in the model.gated_layer_bias (bool, optional) โ Whether to use biases in the linear layers within the GLU. Default:
False
.non_gated_layer_bias (bool, optional) โ Whether to use biases in the linear layers within the GLU. Default:
False
.
Example
from composer.algorithms import GatedLinearUnits algorithm = GatedLinearUnits() trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="1ep", algorithms=[algorithm], optimizers=[optimizer] )
- composer.algorithms.gated_linear_units.gated_linear_units.apply_gated_linear_units(model, optimizers, act_fn=None, gated_layer_bias=False, non_gated_layer_bias=False)[source]#
Replaces the Linear layers in the feed-forward network with Gated Linear Units.
- Parameters
model (torch.nn.Module) โ The model to modify in-place.
optimizers (torch.optim.Optimizer | Sequence[torch.optim.Optimizer], optional) โ
Existing optimizers bound to
model.parameters()
. All optimizers that have already been constructed withmodel.parameters()
must be specified here so that they will optimize the correct parameters.If the optimizer(s) are constructed after calling this function, then it is safe to omit this parameter. These optimizers will see the correct model parameters.
act_fn (Callable[Tensor, Tensor], optional) โ Optionally, the activation function to use. If
None
, the algorithm will use the existing activation function in the model.gated_layer_bias (bool, optional) โ Whether to use biases in the linear layers within the GLU. Default:
False
.non_gated_layer_bias (bool, optional) โ Whether to use biases in the linear layers within the GLU. Default:
False
.
- composer.algorithms.gated_linear_units.gated_linear_units.from_BertIntermediate(layer, module_index)[source]#
Defines a replacement policy from a
transformers.models.bert.modeling_bert.BertIntermediate
to atorch.nn.Identity
The identity effectively acts as no-op.
- composer.algorithms.gated_linear_units.gated_linear_units.from_BertOutput(layer, module_index, act_fn, gated_layer_bias=False, non_gated_layer_bias=False)[source]#
Defines a replacement policy from a
transformers.models.bert.modeling_bert.BertOutput
to acomposer.algorithms.gated_linear_units.gated_linear_unit_layers.BERTGatedFFOutput