composer.algorithms
We describe programmatic modifications to the model or training process as “algorithms.” Examples include smoothing the labels
and adding Squeeze-and-Excitation
blocks, among many others.
Algorithms can be used in two ways:
Using
Algorithm
objects. These objects provide callbacks to be run in the training loop.Using algorithm-specific functions and classes, such as
smooth_labels
orSqueezeExcite2d
.
The former are the easier to compose together, since they all have the same public interface and work automatically with the Composer Trainer
. The latter are easier to integrate piecemeal into an existing codebase.
See Algorithm
for more information.
The following algorithms are available in Composer:
Algorithm to apply ALiBi to the model. |
|
Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data |
|
Algorithm to apply BlurPool to the model. |
|
ChannelsLast algorithm runs on Event.TRAINING_START and changes the memory format of the model to torch.channels_last. |
|
Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level. |
|
Algorithm to apply Ghost Batch Normalization to the model. |
|
Applies label smoothing during before_loss, then restores the original labels during after_loss. |
|
Algorithm to apply Layer Freezing to the model. |
|
Applies MixUp algorithm by modifying the images and labels during Event.AFTER_DATALOADER. |
|
Applies the 'progressive resizing' data augmentation algorithm to speed up training. |
|
Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical |
|
Applies SAM by wrapping existing optimizers with the SAMOptimizer. |
|
Scale the learning rate schedule |
|
Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network. |
|
Algorithm to replace a specified block with a stochastic version of the block. |
|
Apply Stochastic Weight Averaging |
Alibi
Algorithm
- class composer.algorithms.alibi.Alibi(position_embedding_attribute, attention_module_name, attr_to_replace, alibi_attention, mask_replacement_function, heads_per_layer, max_sequence_length, train_sequence_length_scaling)[source]
Algorithm to apply ALiBi to the model. Runs on Event.INIT. This algorithm should be applied before the model has been moved to accelerators.
- Parameters
heads_per_layer (int) – number of attention heads per layer
max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error
position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.
attention_module_name (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is “transformers.models.gpt2.modeling_gpt2.GPT2Attention”.
attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.
alibi_attention (str) – Path to new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.
mask_replacement_function (str) – Path to function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.
train_sequence_length_scaling (float) – Amount by which to scale training sequence length. One batch of training data will be reshaped from size (sequence_length, batch) to (sequence_length*sequence_length_fraction, batch/sequence_length_fraction).
- class composer.algorithms.alibi.AlibiHparams(position_embedding_attribute: 'str', attention_module_name: 'str', attr_to_replace: 'str', alibi_attention: 'str', mask_replacement_function: 'Union[str, None]' = None, heads_per_layer: 'Union[int, Optional[None]]' = None, max_sequence_length: 'int' = 8192, train_sequence_length_scaling: 'float' = 0.25)[source]
Standalone
- composer.algorithms.alibi.apply_alibi(model, heads_per_layer, max_sequence_length, position_embedding_attribute, attention_module, attr_to_replace, alibi_attention, mask_replacement_function)[source]
- Applies ALiBi to the provided model. Removes position embeddings and replaces
the attention function and attention mask.
- Parameters
model (torch.nn.Module) – model to transform
heads_per_layer (int) – number of attention heads per layer
max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error
position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.
attention_module (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is transformers.models.gpt2.modeling_gpt2.GPT2Attention.
attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.
alibi_attention (callable) – new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.
mask_replacement_function (callable) – function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.
- Return type
Augmix
Algorithm
- class composer.algorithms.augmix.AugMix(severity=3, depth=- 1, width=3, alpha=1.0, augmentation_set='all')[source]
- Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data
Processing Method to Improve Robustness and Uncertainty). Can be passed as a transform to torchvision.transforms.Compose().
- Parameters
severity – Severity of augmentation operators (between 1 to 10).
width – Width of augmentation chains (number of parallel augmentations)
depth – Depth of each augmentation chain. -1 enables stochastic depth uniformly from [1, 3]
alpha – Probability coefficient for Beta and Dirichlet distributions. Sampling from Dirichlet determines relative weights of each augmented image sampling from beta determines relative weights of unaugmented and augmented images.
augmentation_set – String, one of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.
Standalone
- composer.algorithms.augmix.augment_and_mix(img=None, severity=3, depth=-1, width=3, alpha=1.0, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]
Perform augmentations.
BlurPool
Algorithm
Standalone
- class composer.algorithms.blurpool.BlurConv2d(in_channels, out_channels, kernel_size, stride=None, padding=0, dilation=1, groups=1, bias=True, blur_first=True)[source]
This module is a drop-in replacement for PyTorch’s Conv2d, but with an anti-aliasing filter applied.
It should be used only to replace strided convolutions.
See the associated paper for more details, experimental results, etc.
See also:
blur_2d()
.
- class composer.algorithms.blurpool.BlurMaxPool2d(kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False)[source]
This module is a (nearly) drop-in replacement for PyTorch’s MaxPool2d, but with an anti-aliasing filter applied.
The only API difference is that the parameter return_indices is not available, because it is ill-defined when using anti-aliasing.
See the associated paper for more details, experimental results, etc.
See also:
blur_2d()
.
- class composer.algorithms.blurpool.BlurPool2d(stride=2, padding=1)[source]
Apply a spatial low-pass filter.
The filter used is:
[1 2 1] [2 4 2] * 1/16 [1 2 1]
This module is a thin wrapper around
blur_2d()
.
- composer.algorithms.blurpool.blur_2d(input, stride=1, filter=None)[source]
Apply a spatial low-pass filter.
- Parameters
input (torch.Tensor) – a 4d tensor in either NCHW or NHWC format.
stride (Union[int, Tuple[int, int]]) – stride(s) along H and W axes. If a single value is passed, this value is used for both dimensions.
padding – implicit zero-padding to use. For the default 3x3 low-pass filter, padding=1 (the default) returns output of the same size as the input.
filter (Optional[torch.Tensor]) – a 2d or 4d tensor to be cross-correlated with the input tensor at each spatial position, within each channel. If 4d, the structure is required to be (C, 1, kH, kW) where C is the number of channels in the input tensor and kH and kW are the spatial sizes of the filter.
- Return type
By default, the filter used is:
[1 2 1] [2 4 2] * 1/16 [1 2 1]
- composer.algorithms.blurpool.apply_blurpool(model, replace_convs=True, replace_maxpools=True, blur_first=True)[source]
Applies BlurPool algorithm to the provided model. Performs an in-place replacement of eligible convolution and pooling layers.
- Parameters
model (torch.nn.Module) – model to transform
replace_convs (bool) – replace eligible Conv2D with BlurConv2d. Default: True.
replace_maxpools (bool) – replace eligible MaxPool2d with BlurMaxPool2d. Default: True.
blur_first (bool) – for replace_convs, blur input before conv. Default: True
- Return type
Channels Last
Algorithm
ColOut
Algorithm
Standalone
- composer.algorithms.colout.colout(img, p_row, p_col)[source]
Drops random rows and columns from a single image.
- Parameters
img (torch.Tensor or PIL Image) – An input image as a torch.Tensor or PIL image
p_row (float) – Fraction of rows to drop (drop along H).
p_col (float) – Fraction of columns to drop (drop along W).
- Returns
torch.Tensor or PIL Image – A smaller image with rows and columns dropped
- Return type
Union[torch.Tensor, PIL.Image.Image]
CutOut
Algorithm
- class composer.algorithms.cutout.CutOut(n_holes, length)[source]
Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level.
- Parameters
X (Tensor) – Batch Tensor image of size (B, C, H, W).
n_holes – Integer number of holes to cut out
length – Side length of the square hole to cut out.
Standalone
- composer.algorithms.cutout.cutout(X, n_holes, length)[source]
Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level. Adapted from the implementation in https://github.com/uoguelph-mlrg/Cutout
- Parameters
- Returns
X_cutout – Image with n_holes of dimension length x length cut out of it.
- Return type
Ghost Batch Normalization
Algorithm
- class composer.algorithms.ghost_batchnorm.GhostBatchNorm(ghost_batch_size=32)[source]
Algorithm to apply Ghost Batch Normalization to the model.
This entails replacing all of the batch normalization modules with ghost batch normalization modules on Event.INIT.
- Parameters
ghost_batch_size – size of sub-batches to normalize over
Standalone
- composer.algorithms.ghost_batchnorm.apply_ghost_batchnorm(model, ghost_batch_size)[source]
Replaces batch normalization modules with ghost batch normalization modules
This algorithm should be applied before the model has been moved to accelerators, and before the model’s parameters have been passed to an optimizer.
- Parameters
model (torch.nn.modules.module.Module) – model to transform
ghost_batch_size (int) – size of sub-batches to normalize over
- Return type
torch.nn.modules.module.Module
Label Smoothing
Algorithm
- class composer.algorithms.label_smoothing.LabelSmoothing(alpha)[source]
Applies label smoothing during before_loss, then restores the original labels during after_loss.
- Parameters
alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)
Standalone
- composer.algorithms.label_smoothing.smooth_labels(logits, targets, alpha)[source]
Shrinks targets towards a prior distribution to counteract label noise.
This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a pre-specified vector of class probabilities.
Introduced in: https://arxiv.org/abs/1512.00567 Evaluated in: https://arxiv.org/abs/1906.02629
- Parameters
logits (torch.Tensor) – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.
targets (torch.Tensor) – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.
alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)
Layer Freezing
Algorithm
- class composer.algorithms.layer_freezing.LayerFreezing(freeze_start=0.5, freeze_level=1.0)[source]
Algorithm to apply Layer Freezing to the model. Runs on Event.EPOCH_END. During training, progressively freeze the layers of the network starting with the earlier layers. Freezing starts after the percent of epochs specified by freeze_start have run. The fraction of layers increases linearly until it reaches freeze_level at the final epoch.
- Parameters
freeze_start – The fraction of epochs to run before freezing begins.
freeze_level – The maximum fraction of levels to freeze.
Standalone
- composer.algorithms.layer_freezing.freeze_layers(model, optimizers, current_epoch, max_epochs, freeze_start, freeze_level, logger)[source]
Implements the layer freezing algorithm. During training, progressively freeze the layers of the network starting with the earlier layers.
- Parameters
model (torch.nn.modules.module.Module) – An instance of the model being trained.
optimizers (Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]]) – The optimizers used during training.
current_epoch (int) – Integer specifying the current epoch.
max_epochs (int) – The max number of epochs training will run for.
freeze_start (float) – The fraction of epochs to run before freezing begins.
freeze_level (float) – The maximum fraction of levels to freeze.
logger (composer.core.logging.logger.Logger) –
MixUp
Algorithm
Standalone
- composer.algorithms.mixup.mixup_batch(x, y, interpolation_lambda, n_classes, indices=None)[source]
Implements mixup on a single batch of data.
This constructs a new batch of data given an original batch. This is done through the convex combination of x with a randomly permuted copy of x. The interploation parameter lambda should be chosen from a beta distribution with parameter alpha. Note that the same lambda is used for all examples within the batch.
Both the original and shuffled labels are returned. This is done because for many loss functions (such as cross entropy) the targets are given as indices, so interpolation must be handled separately.
- Parameters
x (torch.Tensor) – Input tensor of shape (B, d1, d2, …, dn), B is batch size, d1-dn are feature dimensions.
y (torch.Tensor) – Target tensor of shape (B, f1, f2, …, fm), B is batch size, f1-fn are possible target dimensions.
interpolation_lambda (float) – Amount of interpolation based on alpha.
n_classes (int) – Total number of classes.
indices (Optional[torch.Tensor]) – Tensor of shape (B). Permutation of the batch indices. Used for permuting without randomness.
- Returns
x_mix – Batch of inputs after mixup has been applied.
y_mix – Labels after mixup has been applied.
Progressive Resizing
Algorithm
- class composer.algorithms.progressive_resizing.ProgressiveResizing(mode, initial_scale, finetune_fraction, resize_targets)[source]
Applies the ‘progressive resizing’ data augmentation algorithm to speed up training. See Training a State-of-the-Art Model <https://github.com/fastai/fastbook/blob/780b76bef3127ce5b64f8230fce60e915a7e0735/07_sizing_and_tta.ipynb>`__.
“Progressive resizing” initially scales inputs down to speed up early training. Throughout training, the scaling factor is gradually increased, yielding larger inputs up to the original input size. A final finetuning period is then run to finetune the model using the full-sized inputs.
- Parameters
mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘resize’.
initial_scale (float) – Initial scale factor used to shrink the inputs. Must be a value in between 0 and 1.
finetune_fraction (float) – Fraction of training to reserve for finetuning on the full-sized inputs. Must be a value in between 0 and 1.
resize_targets (bool) – If True, resize targets also.
Standalone
- composer.algorithms.progressive_resizing.resize_inputs(X, y, scale_factor, mode='resize', resize_targets=False)[source]
Resize inputs and optionally outputs by cropping or interpolating.
- Parameters
X (torch.Tensor) – Input tensor of shape (N, C, H, W). Resizing will be done along dimensions H and W using the constant factor scale_factor.
y (torch.Tensor) – If resize_targets is True, output tensor of shape (N, C, H, W) that will also be resized.
scale_factor (float) – Scaling coefficient for the height and width of the input/output tensor. 1.0 keeps the original size.
mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘crop’.
resize_targets (bool) – Resize the targets, y, as well. Default: False.
- Returns
X_sized (torch.Tensor) – Resized input tensor of shape (N, C, H * scale_factor, W * scale_factor).
y_sized (torch.Tensor) – If resized_targets is True, resized output tensor of shape (N, C, H * scale_factor, W * scale_factor). Returns original y, otherwise.
- Return type
Tuple[torch.Tensor, torch.Tensor]
RandAugment
Algorithm
- class composer.algorithms.randaugment.RandAugment(severity=9, depth=2, augmentation_set='all')[source]
- Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical
automated data augmentation with a reduced search space). Can be passed as a transform to torchvision.transforms.Compose().
- Parameters
severity (int) – Severity of augmentation operators (between 1 to 10). M in the original paper. Default = 9.
depth (int) – Depth of augmentation chain. N in the original paper Default = 2.
augmentation_set (str) – One of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.
Standalone
- composer.algorithms.randaugment.randaugment(img=None, severity=9, depth=2, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]
Perform augmentations.
Sequence Length Warmup
Algorithm
Standalone
Scaling the Learning Rate Schedule
Algorithm
Standalone
Selective Backpropagation
Algorithm
- class composer.algorithms.selective_backprop.SelectiveBackprop(start, end, keep, scale_factor, interrupt)[source]
Selectively backprop on a subset of each batch.
Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples, and only computes weight gradients over the pruned subset, reducing iteration time and speeding up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.
See Accelerating Deep Learning by Focusing on the Biggest Losers <https://arxiv.org/abs/1910.00762>.
To speed up SB’s selection forward pass, the argument scale_factor can be used to downsample input image tensors. The full-sized inputs will still be used for the weight gradient computation.
To preserve convergence, SB can be interrupted with vanilla minibatch gradient steps every interrupt steps. When interrupt=0, SB will be used at every step during the SB interval. When interrupt=2, SB will alternate with vanilla minibatch steps.
- Parameters
start (float) – SB interval start as fraction of training duration
end (float) – SB interval end as fraction of training duration
keep (float) – fraction of minibatch to select and keep for gradient computation
scale_factor (float) – scale for downsampling input for selection forward pass
interrupt (int) – interrupt SB with a vanilla minibatch step every ‘interrupt’ batches
Squeeze-and-Excitation
Algorithm
- class composer.algorithms.squeeze_excite.SqueezeExcite(latent_channels=64, min_channels=128)[source]
Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.
- Parameters
latent_channels – The dimensionality of the hidden layer within the added MLP.
min_channels – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.
Standalone
- class composer.algorithms.squeeze_excite.SqueezeExcite2d(num_features, latent_channels=0.125)[source]
- class composer.algorithms.squeeze_excite.SqueezeExciteConv2d(*args, latent_channels=0.125, conv=None, **kwargs)[source]
Helper class used to add a Squeeze-and-Excitation block after a Conv2d.
- Parameters
conv (torch.nn.Conv2d) –
- composer.algorithms.squeeze_excite.apply_se(model, latent_channels, min_channels)[source]
Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.
- Parameters
model (torch.nn.modules.module.Module) – A module containing one or more torch.nn.Conv2d modules.
latent_channels (float) – The dimensionality of the hidden layer within the added MLP.
min_channels (int) – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.
Stochastic Depth
Algorithm
- class composer.algorithms.stochastic_depth.StochasticDepth(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', drop_warmup=0.0, use_same_gpu_seed=True)[source]
Algorithm to replace a specified block with a stochastic version of the block.
The stochastic block will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.
- Parameters
stochastic_method – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Which block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or a sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be either ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
drop_warmup (float) – Percentage of training epochs to linearly increase the drop probability to linear_drop_rate. Must be between 0.0 and 1.0.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set false to have each GPU drop a different set of layers. Only used with “block” stochastic method.
Standalone
- class composer.algorithms.stochastic_depth.StochasticBottleneck(drop_rate, module_id, module_count, use_same_gpu_seed, use_same_depth_across_gpus, rand_generator, **kwargs)[source]
Stochastic ResNet Bottleneck layer. This layer has a probability of skipping the transformation section of the layer and scales the transformation section output by (1 - drop probability) during inference.
- Parameters
drop_rate (float) – Probability of dropping the layer. Must be between 0.0 and 1.0.
module_id (int) – The placement of the layer within a network e.g. 0 for the first layer in the network.
module_count (int) – The total number of layers of this type in the network.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers.
use_same_depth_across_gpus (bool) – Set to true to have the same number of layers dropped across GPUs. Set to true when drop_distribution is ‘uniform’ and set to false for ‘linear’.
rand_generator (torch._C.Generator) –
- composer.algorithms.stochastic_depth.apply_stochastic_depth(model, stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True)[source]
Applies Stochastic Depth algorithm to the specified model.
The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.
- Parameters
stochastic_method (str) – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be one of ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers. Only used with “block” stochastic method.
model (torch.nn.modules.module.Module) –
- Return type
Stochastic Weight Averaging
Algorithm
- class composer.algorithms.swa.SWA(swa_start=0.8, anneal_epochs=10, swa_lr=None)[source]
Apply Stochastic Weight Averaging
Stochastic Weight Averaging (SWA) averages model weights sampled towards the end of training. This leads to better generalization than conventional training.
See Averaging Weights Leads to Wider Optima and Better Generalization <https://arxiv.org/abs/1803.05407>.