composer.algorithms

We describe programmatic modifications to the model or training process as “algorithms.” Examples include smoothing the labels and adding Squeeze-and-Excitation blocks, among many others.

Algorithms can be used in two ways:

  • Using Algorithm objects. These objects provide callbacks to be run in the training loop.

  • Using algorithm-specific functions and classes, such as smooth_labels or SqueezeExcite2d.

The former are the easier to compose together, since they all have the same public interface and work automatically with the Composer Trainer. The latter are easier to integrate piecemeal into an existing codebase.

See Algorithm for more information.

The following algorithms are available in Composer:

Alibi

Algorithm to apply ALiBi to the model.

AugMix

Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data

BlurPool

Algorithm to apply BlurPool to the model.

ChannelsLast

ChannelsLast algorithm runs on Event.TRAINING_START and changes the memory format of the model to torch.channels_last.

ColOut

CutOut

Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level.

GhostBatchNorm

Algorithm to apply Ghost Batch Normalization to the model.

LabelSmoothing

Applies label smoothing during before_loss, then restores the original labels during after_loss.

LayerFreezing

Algorithm to apply Layer Freezing to the model.

MixUp

Applies MixUp algorithm by modifying the images and labels during Event.AFTER_DATALOADER.

ProgressiveResizing

Applies the 'progressive resizing' data augmentation algorithm to speed up training.

RandAugment

Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical

SAM

Applies SAM by wrapping existing optimizers with the SAMOptimizer.

ScaleSchedule

Scale the learning rate schedule

SqueezeExcite

Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.

StochasticDepth

Algorithm to replace a specified block with a stochastic version of the block.

SWA

Apply Stochastic Weight Averaging

Alibi

Algorithm

class composer.algorithms.alibi.Alibi(position_embedding_attribute, attention_module_name, attr_to_replace, alibi_attention, mask_replacement_function, heads_per_layer, max_sequence_length, train_sequence_length_scaling)[source]

Algorithm to apply ALiBi to the model. Runs on Event.INIT. This algorithm should be applied before the model has been moved to accelerators.

Parameters
  • heads_per_layer (int) – number of attention heads per layer

  • max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error

  • position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.

  • attention_module_name (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is “transformers.models.gpt2.modeling_gpt2.GPT2Attention”.

  • attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.

  • alibi_attention (str) – Path to new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.

  • mask_replacement_function (str) – Path to function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.

  • train_sequence_length_scaling (float) – Amount by which to scale training sequence length. One batch of training data will be reshaped from size (sequence_length, batch) to (sequence_length*sequence_length_fraction, batch/sequence_length_fraction).

class composer.algorithms.alibi.AlibiHparams(position_embedding_attribute: 'str', attention_module_name: 'str', attr_to_replace: 'str', alibi_attention: 'str', mask_replacement_function: 'Union[str, None]' = None, heads_per_layer: 'Union[int, Optional[None]]' = None, max_sequence_length: 'int' = 8192, train_sequence_length_scaling: 'float' = 0.25)[source]
Parameters
  • position_embedding_attribute (str) –

  • attention_module_name (str) –

  • attr_to_replace (str) –

  • alibi_attention (str) –

  • mask_replacement_function (Optional[str]) –

  • heads_per_layer (Optional[int]) –

  • max_sequence_length (int) –

  • train_sequence_length_scaling (float) –

Return type

None

Standalone

composer.algorithms.alibi.apply_alibi(model, heads_per_layer, max_sequence_length, position_embedding_attribute, attention_module, attr_to_replace, alibi_attention, mask_replacement_function)[source]
Applies ALiBi to the provided model. Removes position embeddings and replaces

the attention function and attention mask.

Parameters
  • model (torch.nn.Module) – model to transform

  • heads_per_layer (int) – number of attention heads per layer

  • max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error

  • position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.

  • attention_module (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is transformers.models.gpt2.modeling_gpt2.GPT2Attention.

  • attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.

  • alibi_attention (callable) – new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.

  • mask_replacement_function (callable) – function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.

Return type

None

Augmix

Algorithm

class composer.algorithms.augmix.AugMix(severity=3, depth=- 1, width=3, alpha=1.0, augmentation_set='all')[source]
Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data

Processing Method to Improve Robustness and Uncertainty). Can be passed as a transform to torchvision.transforms.Compose().

Parameters
  • severity – Severity of augmentation operators (between 1 to 10).

  • width – Width of augmentation chains (number of parallel augmentations)

  • depth – Depth of each augmentation chain. -1 enables stochastic depth uniformly from [1, 3]

  • alpha – Probability coefficient for Beta and Dirichlet distributions. Sampling from Dirichlet determines relative weights of each augmented image sampling from beta determines relative weights of unaugmented and augmented images.

  • augmentation_set – String, one of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.

class composer.algorithms.augmix.AugMixHparams(severity: int = 3, depth: int = - 1, width: int = 3, alpha: float = 1.0, augmentation_set: str = 'all')[source]
Parameters
  • severity (int) –

  • depth (int) –

  • width (int) –

  • alpha (float) –

  • augmentation_set (str) –

Return type

None

Standalone

composer.algorithms.augmix.augment_and_mix(img=None, severity=3, depth=-1, width=3, alpha=1.0, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]

Perform augmentations.

Parameters
  • img (Optional[PIL.Image.Image]) –

  • severity (int) –

  • depth (int) –

  • width (int) –

  • alpha (float) –

  • augmentation_set (List) –

Return type

PIL.Image.Image

BlurPool

Algorithm

class composer.algorithms.blurpool.BlurPool(replace_convs, replace_maxpools, blur_first)[source]

Algorithm to apply BlurPool to the model. Runs on Event.INIT. This algorithm should be applied before the model has been moved to devices.

Parameters
  • replace_convs (bool) – replace eligible Conv2D with BlurConv2d. Default: True.

  • replace_maxpools (bool) – replace eligible MaxPool2d with BlurMaxPool2d. Default: True.

  • blur_first (bool) – for replace_convs, blur input before conv. Default: True

class composer.algorithms.blurpool.BlurPoolHparams(replace_convs: 'bool' = True, replace_maxpools: 'bool' = True, blur_first: 'bool' = True)[source]
Parameters
  • replace_convs (bool) –

  • replace_maxpools (bool) –

  • blur_first (bool) –

Return type

None

Standalone

class composer.algorithms.blurpool.BlurConv2d(in_channels, out_channels, kernel_size, stride=None, padding=0, dilation=1, groups=1, bias=True, blur_first=True)[source]

This module is a drop-in replacement for PyTorch’s Conv2d, but with an anti-aliasing filter applied.

It should be used only to replace strided convolutions.

See the associated paper for more details, experimental results, etc.

See also: blur_2d().

Parameters
  • in_channels (int) –

  • out_channels (int) –

  • kernel_size (Union[int, Tuple[int, int]]) –

  • stride (Union[int, Tuple[int, int]]) –

  • padding (Union[int, Tuple[int, int]]) –

  • dilation (Union[int, Tuple[int, int]]) –

  • groups (int) –

  • bias (bool) –

  • blur_first (bool) –

class composer.algorithms.blurpool.BlurMaxPool2d(kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False)[source]

This module is a (nearly) drop-in replacement for PyTorch’s MaxPool2d, but with an anti-aliasing filter applied.

The only API difference is that the parameter return_indices is not available, because it is ill-defined when using anti-aliasing.

See the associated paper for more details, experimental results, etc.

See also: blur_2d().

Parameters
  • kernel_size (Union[int, Tuple[int, int]]) –

  • stride (Optional[Union[int, Tuple[int, int]]]) –

  • padding (Union[int, Tuple[int, int]]) –

  • dilation (Union[int, Tuple[int, int]]) –

  • ceil_mode (bool) –

class composer.algorithms.blurpool.BlurPool2d(stride=2, padding=1)[source]

Apply a spatial low-pass filter.

The filter used is:

[1 2 1]
[2 4 2] * 1/16
[1 2 1]

This module is a thin wrapper around blur_2d().

Parameters
Return type

None

composer.algorithms.blurpool.blur_2d(input, stride=1, filter=None)[source]

Apply a spatial low-pass filter.

Parameters
  • input (torch.Tensor) – a 4d tensor in either NCHW or NHWC format.

  • stride (Union[int, Tuple[int, int]]) – stride(s) along H and W axes. If a single value is passed, this value is used for both dimensions.

  • padding – implicit zero-padding to use. For the default 3x3 low-pass filter, padding=1 (the default) returns output of the same size as the input.

  • filter (Optional[torch.Tensor]) – a 2d or 4d tensor to be cross-correlated with the input tensor at each spatial position, within each channel. If 4d, the structure is required to be (C, 1, kH, kW) where C is the number of channels in the input tensor and kH and kW are the spatial sizes of the filter.

Return type

torch.Tensor

By default, the filter used is:

[1 2 1]
[2 4 2] * 1/16
[1 2 1]
composer.algorithms.blurpool.apply_blurpool(model, replace_convs=True, replace_maxpools=True, blur_first=True)[source]

Applies BlurPool algorithm to the provided model. Performs an in-place replacement of eligible convolution and pooling layers.

Parameters
  • model (torch.nn.Module) – model to transform

  • replace_convs (bool) – replace eligible Conv2D with BlurConv2d. Default: True.

  • replace_maxpools (bool) – replace eligible MaxPool2d with BlurMaxPool2d. Default: True.

  • blur_first (bool) – for replace_convs, blur input before conv. Default: True

Return type

None

Channels Last

Algorithm

class composer.algorithms.channels_last.ChannelsLast(*args, **kwargs)[source]

ChannelsLast algorithm runs on Event.TRAINING_START and changes the memory format of the model to torch.channels_last. This algorithm has no hyperparameters.

class composer.algorithms.channels_last.ChannelsLastHparams[source]

ChannelsLast algorithm has no hyperparameters.

Return type

None

ColOut

Algorithm

class composer.algorithms.colout.ColOut(p_row=0.15, p_col=0.15, batch=True)[source]
class composer.algorithms.colout.ColOutHparams(p_row: 'float' = 0.15, p_col: 'float' = 0.15, batch: 'bool' = True)[source]
Parameters
Return type

None

Standalone

composer.algorithms.colout.colout(img, p_row, p_col)[source]

Drops random rows and columns from a single image.

Parameters
  • img (torch.Tensor or PIL Image) – An input image as a torch.Tensor or PIL image

  • p_row (float) – Fraction of rows to drop (drop along H).

  • p_col (float) – Fraction of columns to drop (drop along W).

Returns

torch.Tensor or PIL Image – A smaller image with rows and columns dropped

Return type

Union[torch.Tensor, PIL.Image.Image]

CutOut

Algorithm

class composer.algorithms.cutout.CutOut(n_holes, length)[source]

Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level.

Parameters
  • X (Tensor) – Batch Tensor image of size (B, C, H, W).

  • n_holes – Integer number of holes to cut out

  • length – Side length of the square hole to cut out.

class composer.algorithms.cutout.CutOutHparams(n_holes: int, length: int)[source]
Parameters
  • n_holes (int) –

  • length (int) –

Return type

None

Standalone

composer.algorithms.cutout.cutout(X, n_holes, length)[source]

Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level. Adapted from the implementation in https://github.com/uoguelph-mlrg/Cutout

Parameters
  • X (Tensor) – Batch Tensor image of size (B, C, H, W).

  • n_holes (int) – Integer number of holes to cut out

  • length (int) – Side length of the square hole to cut out.

Returns

X_cutout – Image with n_holes of dimension length x length cut out of it.

Return type

torch.Tensor

Ghost Batch Normalization

Algorithm

class composer.algorithms.ghost_batchnorm.GhostBatchNorm(ghost_batch_size=32)[source]

Algorithm to apply Ghost Batch Normalization to the model.

This entails replacing all of the batch normalization modules with ghost batch normalization modules on Event.INIT.

Parameters

ghost_batch_size – size of sub-batches to normalize over

class composer.algorithms.ghost_batchnorm.GhostBatchNormHparams(ghost_batch_size: 'int')[source]
Parameters

ghost_batch_size (int) –

Return type

None

Standalone

composer.algorithms.ghost_batchnorm.apply_ghost_batchnorm(model, ghost_batch_size)[source]

Replaces batch normalization modules with ghost batch normalization modules

This algorithm should be applied before the model has been moved to accelerators, and before the model’s parameters have been passed to an optimizer.

Parameters
  • model (torch.nn.modules.module.Module) – model to transform

  • ghost_batch_size (int) – size of sub-batches to normalize over

Return type

torch.nn.modules.module.Module

Label Smoothing

Algorithm

class composer.algorithms.label_smoothing.LabelSmoothing(alpha)[source]

Applies label smoothing during before_loss, then restores the original labels during after_loss.

Parameters

alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)

class composer.algorithms.label_smoothing.LabelSmoothingHparams(alpha: float)[source]
Parameters

alpha (float) –

Return type

None

Standalone

composer.algorithms.label_smoothing.smooth_labels(logits, targets, alpha)[source]

Shrinks targets towards a prior distribution to counteract label noise.

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a pre-specified vector of class probabilities.

Introduced in: https://arxiv.org/abs/1512.00567 Evaluated in: https://arxiv.org/abs/1906.02629

Parameters
  • logits (torch.Tensor) – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.

  • targets (torch.Tensor) – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.

  • alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)

Layer Freezing

Algorithm

class composer.algorithms.layer_freezing.LayerFreezing(freeze_start=0.5, freeze_level=1.0)[source]

Algorithm to apply Layer Freezing to the model. Runs on Event.EPOCH_END. During training, progressively freeze the layers of the network starting with the earlier layers. Freezing starts after the percent of epochs specified by freeze_start have run. The fraction of layers increases linearly until it reaches freeze_level at the final epoch.

Parameters
  • freeze_start – The fraction of epochs to run before freezing begins.

  • freeze_level – The maximum fraction of levels to freeze.

class composer.algorithms.layer_freezing.LayerFreezingHparams(freeze_start: 'float' = 0.5, freeze_level: 'float' = 1.0)[source]
Parameters
  • freeze_start (float) –

  • freeze_level (float) –

Return type

None

Standalone

composer.algorithms.layer_freezing.freeze_layers(model, optimizers, current_epoch, max_epochs, freeze_start, freeze_level, logger)[source]

Implements the layer freezing algorithm. During training, progressively freeze the layers of the network starting with the earlier layers.

Parameters
  • model (torch.nn.modules.module.Module) – An instance of the model being trained.

  • optimizers (Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]]) – The optimizers used during training.

  • current_epoch (int) – Integer specifying the current epoch.

  • max_epochs (int) – The max number of epochs training will run for.

  • freeze_start (float) – The fraction of epochs to run before freezing begins.

  • freeze_level (float) – The maximum fraction of levels to freeze.

  • logger (composer.core.logging.logger.Logger) –

MixUp

Algorithm

class composer.algorithms.mixup.MixUp(alpha)[source]

Applies MixUp algorithm by modifying the images and labels during Event.AFTER_DATALOADER.

class composer.algorithms.mixup.MixUpHparams(alpha: float)[source]
Parameters

alpha (float) –

Return type

None

Standalone

composer.algorithms.mixup.mixup_batch(x, y, interpolation_lambda, n_classes, indices=None)[source]

Implements mixup on a single batch of data.

This constructs a new batch of data given an original batch. This is done through the convex combination of x with a randomly permuted copy of x. The interploation parameter lambda should be chosen from a beta distribution with parameter alpha. Note that the same lambda is used for all examples within the batch.

Both the original and shuffled labels are returned. This is done because for many loss functions (such as cross entropy) the targets are given as indices, so interpolation must be handled separately.

Parameters
  • x (torch.Tensor) – Input tensor of shape (B, d1, d2, …, dn), B is batch size, d1-dn are feature dimensions.

  • y (torch.Tensor) – Target tensor of shape (B, f1, f2, …, fm), B is batch size, f1-fn are possible target dimensions.

  • interpolation_lambda (float) – Amount of interpolation based on alpha.

  • n_classes (int) – Total number of classes.

  • indices (Optional[torch.Tensor]) – Tensor of shape (B). Permutation of the batch indices. Used for permuting without randomness.

Returns
  • x_mix – Batch of inputs after mixup has been applied.

  • y_mix – Labels after mixup has been applied.

Progressive Resizing

Algorithm

class composer.algorithms.progressive_resizing.ProgressiveResizing(mode, initial_scale, finetune_fraction, resize_targets)[source]

Applies the ‘progressive resizing’ data augmentation algorithm to speed up training. See Training a State-of-the-Art Model <https://github.com/fastai/fastbook/blob/780b76bef3127ce5b64f8230fce60e915a7e0735/07_sizing_and_tta.ipynb>`__.

“Progressive resizing” initially scales inputs down to speed up early training. Throughout training, the scaling factor is gradually increased, yielding larger inputs up to the original input size. A final finetuning period is then run to finetune the model using the full-sized inputs.

Parameters
  • mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘resize’.

  • initial_scale (float) – Initial scale factor used to shrink the inputs. Must be a value in between 0 and 1.

  • finetune_fraction (float) – Fraction of training to reserve for finetuning on the full-sized inputs. Must be a value in between 0 and 1.

  • resize_targets (bool) – If True, resize targets also.

class composer.algorithms.progressive_resizing.ProgressiveResizingHparams(mode='resize', initial_scale=0.5, finetune_fraction=0.2, resize_targets=False)[source]

Hyperparameters for the ‘progressive resizing’ algorithm

Parameters
  • mode (str) –

  • initial_scale (float) –

  • finetune_fraction (float) –

  • resize_targets (bool) –

Return type

None

Standalone

composer.algorithms.progressive_resizing.resize_inputs(X, y, scale_factor, mode='resize', resize_targets=False)[source]

Resize inputs and optionally outputs by cropping or interpolating.

Parameters
  • X (torch.Tensor) – Input tensor of shape (N, C, H, W). Resizing will be done along dimensions H and W using the constant factor scale_factor.

  • y (torch.Tensor) – If resize_targets is True, output tensor of shape (N, C, H, W) that will also be resized.

  • scale_factor (float) – Scaling coefficient for the height and width of the input/output tensor. 1.0 keeps the original size.

  • mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘crop’.

  • resize_targets (bool) – Resize the targets, y, as well. Default: False.

Returns
  • X_sized (torch.Tensor) – Resized input tensor of shape (N, C, H * scale_factor, W * scale_factor).

  • y_sized (torch.Tensor) – If resized_targets is True, resized output tensor of shape (N, C, H * scale_factor, W * scale_factor). Returns original y, otherwise.

Return type

Tuple[torch.Tensor, torch.Tensor]

RandAugment

Algorithm

class composer.algorithms.randaugment.RandAugment(severity=9, depth=2, augmentation_set='all')[source]
Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical

automated data augmentation with a reduced search space). Can be passed as a transform to torchvision.transforms.Compose().

Parameters
  • severity (int) – Severity of augmentation operators (between 1 to 10). M in the original paper. Default = 9.

  • depth (int) – Depth of augmentation chain. N in the original paper Default = 2.

  • augmentation_set (str) – One of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.

class composer.algorithms.randaugment.RandAugmentHparams(severity: int = 9, depth: int = 2, augmentation_set: str = 'all')[source]
Parameters
  • severity (int) –

  • depth (int) –

  • augmentation_set (str) –

Return type

None

Standalone

composer.algorithms.randaugment.randaugment(img=None, severity=9, depth=2, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]

Perform augmentations.

Parameters
  • img (Optional[PIL.Image.Image]) –

  • severity (int) –

  • depth (int) –

  • augmentation_set (List) –

Return type

PIL.Image.Image

Sequence Length Warmup

Algorithm

Standalone

Sharpness-Aware Minimization

Algorithm

class composer.algorithms.sam.SAM(rho=0.05, epsilon=1e-12, interval=1)[source]

Applies SAM by wrapping existing optimizers with the SAMOptimizer.

class composer.algorithms.sam.SAMHparams(rho: 'float' = 0.05, epsilon: 'float' = 1e-12, interval: 'int' = 1)[source]
Parameters
Return type

None

Scaling the Learning Rate Schedule

Algorithm

class composer.algorithms.scale_schedule.ScaleSchedule(ratio, method='epoch')[source]

Scale the learning rate schedule

Parameters
  • ratio (float) – Ratio of full training schedule

  • ( (method) – obj: str, optional): Step or epoch, defaults to epoch

class composer.algorithms.scale_schedule.ScaleScheduleHparams(ratio: float, method: str = 'epoch')[source]
Parameters
Return type

None

Standalone

composer.algorithms.scale_schedule.scale_scheduler(scheduler, ssr, orig_max_epochs=None)[source]
Parameters
  • scheduler (torch.optim.lr_scheduler._LRScheduler) –

  • ssr (float) –

  • orig_max_epochs (Optional[int]) –

Selective Backpropagation

Algorithm

class composer.algorithms.selective_backprop.SelectiveBackprop(start, end, keep, scale_factor, interrupt)[source]

Selectively backprop on a subset of each batch.

Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples, and only computes weight gradients over the pruned subset, reducing iteration time and speeding up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.

See Accelerating Deep Learning by Focusing on the Biggest Losers <https://arxiv.org/abs/1910.00762>.

To speed up SB’s selection forward pass, the argument scale_factor can be used to downsample input image tensors. The full-sized inputs will still be used for the weight gradient computation.

To preserve convergence, SB can be interrupted with vanilla minibatch gradient steps every interrupt steps. When interrupt=0, SB will be used at every step during the SB interval. When interrupt=2, SB will alternate with vanilla minibatch steps.

Parameters
  • start (float) – SB interval start as fraction of training duration

  • end (float) – SB interval end as fraction of training duration

  • keep (float) – fraction of minibatch to select and keep for gradient computation

  • scale_factor (float) – scale for downsampling input for selection forward pass

  • interrupt (int) – interrupt SB with a vanilla minibatch step every ‘interrupt’ batches

class composer.algorithms.selective_backprop.SelectiveBackpropHparams(start: 'float', end: 'float', keep: 'float', scale_factor: 'float', interrupt: 'int')[source]
Parameters
Return type

None

Squeeze-and-Excitation

Algorithm

class composer.algorithms.squeeze_excite.SqueezeExcite(latent_channels=64, min_channels=128)[source]

Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.

Parameters
  • latent_channels – The dimensionality of the hidden layer within the added MLP.

  • min_channels – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.

class composer.algorithms.squeeze_excite.SqueezeExciteHparams(latent_channels: 'float' = 64, min_channels: 'int' = 128)[source]
Parameters
  • latent_channels (float) –

  • min_channels (int) –

Return type

None

Standalone

class composer.algorithms.squeeze_excite.SqueezeExcite2d(num_features, latent_channels=0.125)[source]

Squeeze-and-Excitation block

class composer.algorithms.squeeze_excite.SqueezeExciteConv2d(*args, latent_channels=0.125, conv=None, **kwargs)[source]

Helper class used to add a Squeeze-and-Excitation block after a Conv2d.

Parameters

conv (torch.nn.Conv2d) –

composer.algorithms.squeeze_excite.apply_se(model, latent_channels, min_channels)[source]

Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.

Parameters
  • model (torch.nn.modules.module.Module) – A module containing one or more torch.nn.Conv2d modules.

  • latent_channels (float) – The dimensionality of the hidden layer within the added MLP.

  • min_channels (int) – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.

Stochastic Depth

Algorithm

class composer.algorithms.stochastic_depth.StochasticDepth(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', drop_warmup=0.0, use_same_gpu_seed=True)[source]

Algorithm to replace a specified block with a stochastic version of the block.

The stochastic block will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.

Parameters
  • stochastic_method – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.

  • target_layer_name (str) – Which block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].

  • drop_rate (float) – The base probability of dropping a layer or a sample. Must be between 0.0 and 1.0.

  • drop_distribution (str) – How drop_rate is distributed across layers. Value must be either ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.

  • drop_warmup (float) – Percentage of training epochs to linearly increase the drop probability to linear_drop_rate. Must be between 0.0 and 1.0.

  • use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set false to have each GPU drop a different set of layers. Only used with “block” stochastic method.

class composer.algorithms.stochastic_depth.StochasticDepthHparams(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True, drop_warmup=0.0)[source]

Hyperparameters for the Stochastic Depth algorithm

Parameters
  • stochastic_method (str) –

  • target_layer_name (str) –

  • drop_rate (float) –

  • drop_distribution (str) –

  • use_same_gpu_seed (bool) –

  • drop_warmup (float) –

Return type

None

Standalone

class composer.algorithms.stochastic_depth.StochasticBottleneck(drop_rate, module_id, module_count, use_same_gpu_seed, use_same_depth_across_gpus, rand_generator, **kwargs)[source]

Stochastic ResNet Bottleneck layer. This layer has a probability of skipping the transformation section of the layer and scales the transformation section output by (1 - drop probability) during inference.

Parameters
  • drop_rate (float) – Probability of dropping the layer. Must be between 0.0 and 1.0.

  • module_id (int) – The placement of the layer within a network e.g. 0 for the first layer in the network.

  • module_count (int) – The total number of layers of this type in the network.

  • use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers.

  • use_same_depth_across_gpus (bool) – Set to true to have the same number of layers dropped across GPUs. Set to true when drop_distribution is ‘uniform’ and set to false for ‘linear’.

  • rand_generator (torch._C.Generator) –

composer.algorithms.stochastic_depth.apply_stochastic_depth(model, stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True)[source]

Applies Stochastic Depth algorithm to the specified model.

The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.

Parameters
  • stochastic_method (str) – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.

  • target_layer_name (str) – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].

  • drop_rate (float) – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.

  • drop_distribution (str) – How drop_rate is distributed across layers. Value must be one of ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.

  • use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers. Only used with “block” stochastic method.

  • model (torch.nn.modules.module.Module) –

Return type

None

Stochastic Weight Averaging

Algorithm

class composer.algorithms.swa.SWA(swa_start=0.8, anneal_epochs=10, swa_lr=None)[source]

Apply Stochastic Weight Averaging

Stochastic Weight Averaging (SWA) averages model weights sampled towards the end of training. This leads to better generalization than conventional training.

See Averaging Weights Leads to Wider Optima and Better Generalization <https://arxiv.org/abs/1803.05407>.

Parameters
  • swa_start (float) – fraction of training completed before stochastic weight averaging is applied

  • anneal_epochs (int) – The final learning rate to anneal towards

  • swa_lr (float) – fraction of minibatch to select and keep for gradient computation

class composer.algorithms.swa.SWAHparams(swa_start: 'float' = 0.8, anneal_epochs: 'int' = 10, swa_lr: 'Optional[float]' = None)[source]
Parameters
  • swa_start (float) –

  • anneal_epochs (int) –

  • swa_lr (Optional[float]) –

Return type

None