composer.algorithms

We describe programmatic modifications to the model or training process as “algorithms.” Examples include smoothing the labels and adding Squeeze-and-Excitation blocks, among many others.

Algorithms can be used in two ways:

  • Using Algorithm objects. These objects provide callbacks to be run in the training loop.

  • Using algorithm-specific functions and classes, such as smooth_labels or SqueezeExcite2d.

The former are the easier to compose together, since they all have the same public interface and work automatically with the Composer Trainer. The latter are easier to integrate piecemeal into an existing codebase.

See Algorithm for more information.

The following algorithms are available in Composer:

Alibi

AliBi (Attention with Linear Biases) dispenses with position embeddings and instead directly biases attention matrices such that nearby tokens attend to one another more strongly.

AugMix

AugMix creates width sequences of depth image augmentations, applies each sequence with random intensity, and returns a convex combination of the width augmented images and the original image.

BlurPool

BlurPool adds anti-aliasing filters to convolutional layers to increase accuracy and invariance to small shifts in the input.

ChannelsLast

Changes the memory format of the model to torch.channels_last.

ColOut

Drops a fraction of the rows and columns of an input image.

CutOut

Cutout is a data augmentation technique that works by masking out one or more square regions of an input image.

GhostBatchNorm

Replaces batch normalization modules with Ghost Batch Normalization modules that simulate the effect of using a smaller batch size.

LabelSmoothing

Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..

LayerFreezing

Progressively freeze the layers of the network during training, starting with the earlier layers.

MixUp

MixUp trains the network on convex combinations of pairs of examples and targets rather than individual examples and targets.

ProgressiveResizing

Apply Fastai's progressive resizing data augmentation to speed up training

RandAugment

Randomly applies a sequence of image data augmentations (Cubuk et al. 2019).

SAM

Adds sharpness-aware minimization (Foret et al. 2020) by wrapping an existing optimizer with a SAMOptimizer.

ScaleSchedule

Makes the learning rate schedule take a different number of epochs.

SelectiveBackprop

Selectively backpropagate gradients from a subset of each batch (Jiang et al. 2019).

SqueezeExcite

Adds Squeeze-and-Excitation blocks (Hu et al. 2019) after the Conv2d modules in a neural network.

StochasticDepth

Applies Stochastic Depth (Huang et al.) to the specified model.

SWA

Apply Stochastic Weight Averaging (Izmailov et al.).

Alibi

Algorithm

class composer.algorithms.alibi.Alibi(position_embedding_attribute: str, attention_module_name: str, attr_to_replace: str, alibi_attention: str, mask_replacement_function: str, heads_per_layer: int, max_sequence_length: int, train_sequence_length_scaling: float)[source]

AliBi (Attention with Linear Biases) dispenses with position embeddings and instead directly biases attention matrices such that nearby tokens attend to one another more strongly.

ALiBi yields excellent extrapolation to unseen sequence lengths compared to other position embedding schemes. We leverage this extrapolation capability by training with shorter sequence lengths, which reduces the memory and computation load.

This algorithm modifies the model and runs on Event.INIT. This algorithm should be applied before the model has been moved to accelerators.

Parameters
  • heads_per_layer – number of attention heads per layer

  • max_sequence_length – maximum sequence length that the model will be able to accept without returning an error

  • position_embedding_attribute – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.

  • attention_module_name – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is “transformers.models.gpt2.modeling_gpt2.GPT2Attention”.

  • attr_to_replace – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.

  • alibi_attention – Path to new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.

  • mask_replacement_function – Path to function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.

  • train_sequence_length_scaling – Amount by which to scale training sequence length. One batch of training data will be reshaped from size (sequence_length, batch) to (sequence_length*sequence_length_fraction, batch/sequence_length_fraction).

class composer.algorithms.alibi.AlibiHparams(position_embedding_attribute: str, attention_module_name: str, attr_to_replace: str, alibi_attention: str, mask_replacement_function: Optional[str] = None, heads_per_layer: Optional[int] = None, max_sequence_length: int = 8192, train_sequence_length_scaling: float = 0.25)[source]

See Alibi

Standalone

composer.algorithms.alibi.apply_alibi(model: torch.nn.modules.module.Module, heads_per_layer: int, max_sequence_length: int, position_embedding_attribute: str, attention_module: torch.nn.modules.module.Module, attr_to_replace: str, alibi_attention: Callable, mask_replacement_function: Optional[Callable]) None[source]

Removes position embeddings and replaces the attention function and attention mask according to AliBi.

Parameters
  • model – model to transform

  • heads_per_layer – number of attention heads per layer

  • max_sequence_length – maximum sequence length that the model will be able to accept without returning an error

  • position_embedding_attribute – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.

  • attention_module – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is transformers.models.gpt2.modeling_gpt2.GPT2Attention.

  • attr_to_replace – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.

  • alibi_attention – new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.

  • mask_replacement_function – function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.

Augmix

Algorithm

class composer.algorithms.augmix.AugMix(severity: int = 3, depth: int = - 1, width: int = 3, alpha: float = 1.0, augmentation_set: str = 'all')[source]

AugMix creates width sequences of depth image augmentations, applies each sequence with random intensity, and returns a convex combination of the width augmented images and the original image.

The coefficients for mixing the augmented images are drawn from a uniform Dirichlet(alpha, alpha, ...) distribution. The coefficient for mixing the combined augmented image and the original image is drawn from a Beta(alpha, alpha) distribution, using the same alpha.

Runs on Event.TRAINING_START.

Parameters
  • severity – severity of augmentations; ranges from 0 (no augmentation) to 10 (most severe).

  • width – number of augmentation sequences

  • depth – number of augmentations per sequence. -1 enables stochastic depth sampled uniformly from [1, 3].

  • alpha – pseudocount for Beta and Dirichlet distributions. Must be > 0. Higher values yield mixing coefficients closer to uniform weighting. As the value approaches 0, the mixing coefficients approach using only one version of each image.

  • augmentation_set

    must be one of the following options:

    • "augmentations_all"

      Uses all augmentations from the paper.

    • "augmentations_corruption_safe"

      Like "augmentations_all", but excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets

    • "augmentations_original"

      Like "augmentations_all", but some of the implementations are identical to the original Github repository, which contains implementation specificities for the augmentations "color", "contrast", "sharpness", and "brightness".

class composer.algorithms.augmix.AugMixHparams(severity: int = 3, depth: int = - 1, width: int = 3, alpha: float = 1.0, augmentation_set: str = 'all')[source]

See AugMix

Standalone

composer.algorithms.augmix.augment_and_mix(img: Optional[PIL.Image.Image] = None, severity: int = 3, depth: int = -1, width: int = 3, alpha: float = 1.0, augmentation_set: List = [<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>]) PIL.Image.Image[source]

Applies AugMix (Hendrycks et al.) data augmentation to an image. See AugMix for details.

BlurPool

Algorithm

class composer.algorithms.blurpool.BlurPool(replace_convs: bool, replace_maxpools: bool, blur_first: bool)[source]

BlurPool adds anti-aliasing filters to convolutional layers to increase accuracy and invariance to small shifts in the input.

Runs on Event.INIT and should be applied both before the model has been moved to accelerators and before the model’s parameters have been passed to an optimizer.

Parameters
  • replace_convs – replace strided torch.nn.Conv2d modules with BlurConv2d modules

  • replace_maxpools – replace eligible torch.nn.MaxPool2d modules with BlurMaxPool2d modules.

  • blur_first – when replace_convs is True, blur input before the associated convolution. When set to False, the convolution is applied with a stride of 1 before the blurring, resulting in significant overhead (though more closely matching the paper). See BlurConv2d for further discussion.

class composer.algorithms.blurpool.BlurPoolHparams(replace_convs: bool = True, replace_maxpools: bool = True, blur_first: bool = True)[source]

See BlurPool

Standalone

class composer.algorithms.blurpool.BlurConv2d(in_channels: int, out_channels: int, kernel_size: Union[int, Tuple[int, int]], stride: Optional[Union[int, Tuple[int, int]]] = None, padding: Union[int, Tuple[int, int]] = 0, dilation: Union[int, Tuple[int, int]] = 1, groups: int = 1, bias: bool = True, blur_first: bool = True)[source]

This module is a drop-in replacement for PyTorch’s Conv2d, but with an anti-aliasing filter applied.

The one new parameter is blur_first. When set to True, the anti-aliasing filter is applied before the underlying convolution, and vice-versa when set to False. This mostly makes a difference when the stride is greater than one. In the former case, the only overhead is the cost of doing the anti-aliasing operation. In the latter case, the Conv2d is applied with a stride of one to the input, and then the anti-aliasing is applied with the provided stride to the result. Setting the stride of the convolution to 1 can greatly increase the computational cost. E.g., replacing a stride of (2, 2) with a stride of 1 increases the number of operations by a factor of (2/1) * (2/1) = 4. However, this approach most closely matches the behavior specified in the paper.

This module should only be used to replace strided convolutions.

See the associated paper for more details, experimental results, etc.

See also: blur_2d().

class composer.algorithms.blurpool.BlurMaxPool2d(kernel_size: Union[int, Tuple[int, int]], stride: Optional[Union[int, Tuple[int, int]]] = None, padding: Union[int, Tuple[int, int]] = 0, dilation: Union[int, Tuple[int, int]] = 1, ceil_mode: bool = False)[source]

This module is a (nearly) drop-in replacement for PyTorch’s MaxPool2d, but with an anti-aliasing filter applied.

The only API difference is that the parameter return_indices is not available, because it is ill-defined when using anti-aliasing.

See the associated paper for more details, experimental results, etc.

See also: blur_2d().

class composer.algorithms.blurpool.BlurPool2d(stride: Union[int, Tuple[int, int]] = 2, padding: Union[int, Tuple[int, int]] = 1)[source]

This module just calls blur_2d() in forward using the provided arguments.

composer.algorithms.blurpool.blur_2d(input: torch.Tensor, stride: Union[int, Tuple[int, int]] = 1, filter: Optional[torch.Tensor] = None) torch.Tensor[source]

Apply a spatial low-pass filter.

Parameters
  • input – a 4d tensor of shape NCHW

  • stride – stride(s) along H and W axes. If a single value is passed, this value is used for both dimensions.

  • padding – implicit zero-padding to use. For the default 3x3 low-pass filter, padding=1 (the default) returns output of the same size as the input.

  • filter – a 2d or 4d tensor to be cross-correlated with the input tensor at each spatial position, within each channel. If 4d, the structure is required to be (C, 1, kH, kW) where C is the number of channels in the input tensor and kH and kW are the spatial sizes of the filter.

By default, the filter used is:

[1 2 1]
[2 4 2] * 1/16
[1 2 1]
composer.algorithms.blurpool.apply_blurpool(model: torch.nn.modules.module.Module, replace_convs: bool = True, replace_maxpools: bool = True, blur_first: bool = True) None[source]

Add anti-aliasing filters to the strided torch.nn.Conv2d and/or torch.nn.MaxPool2d modules within model.

Must be run before the model has been moved to accelerators and before the model’s parameters have been passed to an optimizer.

Parameters
  • model – model to modify

  • replace_convs – replace strided torch.nn.Conv2d modules with BlurConv2d modules

  • replace_maxpools – replace eligible torch.nn.MaxPool2d modules with BlurMaxPool2d modules.

  • blur_first – for replace_convs, blur input before the associated convolution. When set to False, the convolution is applied with a stride of 1 before the blurring, resulting in significant overhead (though more closely matching the paper). See BlurConv2d for further discussion.

Channels Last

Algorithm

class composer.algorithms.channels_last.ChannelsLast(*args, **kwargs)[source]

Changes the memory format of the model to torch.channels_last. This usually yields improved GPU utilization.

Runs on Event.TRAINING_START and has no hyperparameters.

class composer.algorithms.channels_last.ChannelsLastHparams[source]

ChannelsLast has no hyperparameters, so this class has no member variables

ColOut

Algorithm

class composer.algorithms.colout.ColOut(p_row: float = 0.15, p_col: float = 0.15, batch: bool = True)[source]

Drops a fraction of the rows and columns of an input image. If the fraction of rows/columns dropped isn’t too large, this does not significantly alter the content of the image, but reduces its size and provides extra variability.

Parameters
  • p_row – Fraction of rows to drop (drop along H).

  • p_col – Fraction of columns to drop (drop along W).

  • batch – Run ColOut at the batch level.

class composer.algorithms.colout.ColOutHparams(p_row: float = 0.15, p_col: float = 0.15, batch: bool = True)[source]

See ColOut

Standalone

composer.algorithms.colout.colout(img: Union[torch.Tensor, PIL.Image.Image], p_row: float, p_col: float) Union[torch.Tensor, PIL.Image.Image][source]

Drops random rows and columns from a single image.

Parameters
  • img (torch.Tensor or PIL Image) – An input image as a torch.Tensor or PIL image

  • p_row (float) – Fraction of rows to drop (drop along H).

  • p_col (float) – Fraction of columns to drop (drop along W).

Returns

torch.Tensor or PIL Image – A smaller image with rows and columns dropped

CutOut

Algorithm

class composer.algorithms.cutout.CutOut(n_holes: int, length: int)[source]

Cutout is a data augmentation technique that works by masking out one or more square regions of an input image.

This implementation cuts out the same square from all images in a batch.

Parameters
  • X (Tensor) – Batch Tensor image of size (B, C, H, W).

  • n_holes – Integer number of holes to cut out

  • length – Side length of the square hole to cut out.

class composer.algorithms.cutout.CutOutHparams(n_holes: int, length: int)[source]

See CutOut

Standalone

composer.algorithms.cutout.cutout(X: torch.Tensor, n_holes: int, length: int) torch.Tensor[source]

See CutOut.

Parameters
  • X (Tensor) – Batch Tensor image of size (B, C, H, W).

  • n_holes – Integer number of holes to cut out

  • length – Side length of the square hole to cut out.

Returns

X_cutout – Image with n_holes of dimension length x length cut out of it.

Ghost Batch Normalization

Algorithm

class composer.algorithms.ghost_batchnorm.GhostBatchNorm(ghost_batch_size: int = 32)[source]

Replaces batch normalization modules with Ghost Batch Normalization modules that simulate the effect of using a smaller batch size.

Works by spliting input into chunks of ghost_batch_size samples and running batch normalization on each chunk separately. Dim 0 is assumed to be the sample axis.

Runs on Event.INIT and should be applied both before the model has been moved to accelerators and before the model’s parameters have been passed to an optimizer.

Parameters

ghost_batch_size – size of sub-batches to normalize over

class composer.algorithms.ghost_batchnorm.GhostBatchNormHparams(ghost_batch_size: int)[source]

See GhostBatchNorm

Standalone

composer.algorithms.ghost_batchnorm.apply_ghost_batchnorm(model: torch.nn.modules.module.Module, ghost_batch_size: int) torch.nn.modules.module.Module[source]

Replace batch normalization modules with ghost batch normalization modules.

Must be run before the model has been moved to accelerators and before the model’s parameters have been passed to an optimizer.

Parameters
  • model – model to transform

  • ghost_batch_size – size of sub-batches to normalize over

Label Smoothing

Algorithm

class composer.algorithms.label_smoothing.LabelSmoothing(alpha: float)[source]

Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a vector of ones.

Introduced in Rethinking the Inception Architecture for Computer Vision.

Parameters

alpha – Strength of the label smoothing, in [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored).

class composer.algorithms.label_smoothing.LabelSmoothingHparams(alpha: float)[source]

See LabelSmoothing

Standalone

composer.algorithms.label_smoothing.smooth_labels(logits: torch.Tensor, targets: torch.Tensor, alpha: float)[source]

Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a vector of ones.

Parameters
  • logits – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.

  • targets – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.

  • alpha – Strength of the label smoothing, in [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored).

Layer Freezing

Algorithm

class composer.algorithms.layer_freezing.LayerFreezing(freeze_start: float = 0.5, freeze_level: float = 1.0)[source]

Progressively freeze the layers of the network during training, starting with the earlier layers.

Freezing starts after the fraction of epochs specified by freeze_start have run. The fraction of layers frozen increases linearly until it reaches freeze_level at the final epoch.

This freezing schedule is most similar to FreezeOut and Freeze Training.

Runs on Event.EPOCH_END.

Parameters
  • freeze_start – the fraction of epochs to run before freezing begins

  • freeze_level – the maximum fraction of layers to freeze

class composer.algorithms.layer_freezing.LayerFreezingHparams(freeze_start: float = 0.5, freeze_level: float = 1.0)[source]

See LayerFreezing

Standalone

composer.algorithms.layer_freezing.freeze_layers(model: torch.nn.modules.module.Module, optimizers: Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]], current_epoch: int, max_epochs: int, freeze_start: float, freeze_level: float, logger: Optional[composer.core.logging.logger.Logger] = None) torch.nn.modules.module.Module[source]

Progressively freeze the layers of the network during training, starting with the earlier layers.

Parameters
  • model – an instance of the model being trained

  • optimizers – the optimizers used during training

  • current_epoch – integer specifying the current epoch

  • max_epochs – the max number of epochs training will run for

  • freeze_start – the fraction of epochs to run before freezing begins

  • freeze_level – the maximum fraction of layers to freeze

MixUp

Algorithm

class composer.algorithms.mixup.MixUp(alpha: float)[source]

MixUp trains the network on convex combinations of pairs of examples and targets rather than individual examples and targets.

This is done by taking a convex combination of a given batch X with a randomly permuted copy of X. The mixing coefficient is drawn from a Beta(alpha, alpha) distribution.

Training in this fashion reduces generalization error.

Parameters

alpha – the psuedocount for the Beta distribution used to sample interpolation parameters. As alpha grows, the two samples in each pair tend to be weighted more equally. As alpha approaches 0 from above, the combination approaches only using one element of the pair.

class composer.algorithms.mixup.MixUpHparams(alpha: float)[source]

See MixUp

Standalone

composer.algorithms.mixup.mixup_batch(x: torch.Tensor, y: torch.Tensor, interpolation_lambda: float, n_classes: int, indices: Optional[torch.Tensor] = None) Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Create new samples using convex combinations of pairs of samples.

This is done by taking a convex combination of x with a randomly permuted copy of x. The interploation parameter lambda should be chosen from a Beta(alpha, alpha) distribution for some parameter alpha > 0. Note that the same lambda is used for all examples within the batch.

Both the original and shuffled labels are returned. This is done because for many loss functions (such as cross entropy) the targets are given as indices, so interpolation must be handled separately.

Parameters
  • x – input tensor of shape (B, d1, d2, …, dn), B is batch size, d1-dn are feature dimensions.

  • y – target tensor of shape (B, f1, f2, …, fm), B is batch size, f1-fn are possible target dimensions.

  • interpolation_lambda – amount of interpolation based on alpha.

  • n_classes – total number of classes.

  • indices – Permutation of the batch indices 1..B. Used for permuting without randomness.

Returns
  • x_mix – batch of inputs after mixup has been applied

  • y_mix – labels after mixup has been applied

  • perm – the permutation used

Example

from composer import functional as CF

for X, y in dataloader:

l = CF.gen_interpolation_lambda(alpha=0.2) X, y, _ = CF.mixup_batch(X, y, l, nclasses)

pred = model(X) loss = loss_fun(pred, y) # loss_fun must accept dense labels (ie NOT indices)

Progressive Resizing

Algorithm

class composer.algorithms.progressive_resizing.ProgressiveResizing(mode: str = 'resize', initial_scale: float = 0.5, finetune_fraction: float = 0.2, resize_targets: bool = False)[source]

Apply Fastai’s progressive resizing data augmentation to speed up training

Progressive resizing initially reduces input resolution to speed up early training. Throughout training, the downsampling factor is gradually increased, yielding larger inputs up to the original input size. A final finetuning period is then run to finetune the model using the full-sized inputs.

Parameters
  • mode – Type of scaling to perform. Value must be one of 'crop' or 'resize'. 'crop' performs a random crop, whereas 'resize' performs a bilinear interpolation.

  • initial_scale – Initial scale factor used to shrink the inputs. Must be a value in between 0 and 1.

  • finetune_fraction – Fraction of training to reserve for finetuning on the full-sized inputs. Must be a value in between 0 and 1.

  • resize_targets – If True, resize targets also.

class composer.algorithms.progressive_resizing.ProgressiveResizingHparams(mode: str = 'resize', initial_scale: float = 0.5, finetune_fraction: float = 0.2, resize_targets: bool = False)[source]

See ProgressiveResizing

Standalone

composer.algorithms.progressive_resizing.resize_inputs(X: torch.Tensor, y: torch.Tensor, scale_factor: float, mode: str = 'resize', resize_targets: bool = False) Tuple[torch.Tensor, torch.Tensor][source]

Resize inputs and optionally outputs by cropping or interpolating.

Parameters
  • X – input tensor of shape (N, C, H, W). Resizing will be done along dimensions H and W using the constant factor scale_factor.

  • y – output tensor of shape (N, C, H, W) that will also be resized if resize_targets is True,

  • scale_factor – scaling coefficient for the height and width of the input/output tensor. 1.0 keeps the original size.

  • mode – type of scaling to perform. Value must be one of 'crop' or 'resize'. 'crop' performs a random crop, whereas 'resize' performs a bilinear interpolation.

  • resize_targets – whether to resize the targets, y, as well

Returns
  • X_sized – resized input tensor of shape (N, C, H * scale_factor, W * scale_factor).

  • y_sized – if resized_targets is True, resized output tensor of shape (N, C, H * scale_factor, W * scale_factor). Otherwise returns original y.

RandAugment

Algorithm

class composer.algorithms.randaugment.RandAugment(severity: int = 9, depth: int = 2, augmentation_set: str = 'all')[source]

Randomly applies a sequence of image data augmentations (Cubuk et al. 2019).

Parameters
  • severity (int) – Severity of augmentation operators (between 1 to 10). M in the original paper. Default = 9.

  • depth (int) – Depth of augmentation chain. N in the original paper Default = 2.

  • augmentation_set (str) – One of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”. The original implementations have an intensity sampling scheme that samples a value bounded by 0.118 at a minimum, and a maximum value of intensity*0.18 + .1, which ranges from 0.28 (intensity = 1) to 1.9 (intensity 10). These augmentations have different effects depending on whether they are < 0 or > 0 (or < 1 or > 1). “augmentations_all” uses implementations of “color”, “contrast”, “sharpness”, and “brightness” that account for diverging effects around 0 (or 1).

class composer.algorithms.randaugment.RandAugmentHparams(severity: int = 9, depth: int = 2, augmentation_set: str = 'all')[source]

See RandAugment

Standalone

composer.algorithms.randaugment.randaugment(img: Optional[PIL.Image.Image] = None, severity: int = 9, depth: int = 2, augmentation_set: List = [<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>]) PIL.Image.Image[source]

Randomly applies a sequence of image data augmentations (Cubuk et al. 2019). See RandAugment for details.

Sequence Length Warmup

Algorithm

class composer.algorithms.seq_length_warmup.SeqLengthWarmup(duration: float = 0.3, min_seq_length: int = 8, max_seq_length: int = 1024, step_size: int = 8, truncate: bool = True)[source]

Progressively increases the sequence length during training.

Changes the sequence length of all tensors in the input batch. The sequence length increases from min_seq_length to max_seq_length in steps of step_size during the first duration fraction of training.

The sequence length is then kept at max_seq_length for the rest of training.

Tensors are either truncated (truncate=True) or reshaped to create new examples from the extra tokens (truncate=False).

Note

step_size should be a multiple of eight for GPUs

Note

Variable input lengths can create CUDA OOM errors. To avoid this, we follow PyTorch notes and pre-allocate the memory with a blank forward and backward pass.

Parameters
  • duration (float) – fraction of total training for sequential length learning.

  • min_seq_length (int) – Minimum sequence length to start the warmup.

  • max_seq_length (int) – Maximum sequence length to stop the warmup.

  • step_size (int) – Step size of sequence length.

  • truncate (bool) – Truncate tensors or reshape extra tokens to new examples

class composer.algorithms.seq_length_warmup.SeqLengthWarmupHparams(duration: float = 0.3, min_seq_length: int = 8, max_seq_length: int = 1024, step_size: int = 8, truncate: bool = True)[source]

Standalone

composer.algorithms.seq_length_warmup.apply_seq_length_warmup(batch: Dict[str, torch.Tensor], curr_seq_len: int, truncate: bool) Union[Sequence[Union[torch.Tensor, Tuple[torch.Tensor, ...], List[torch.Tensor]]], Dict[str, torch.Tensor], torch.Tensor][source]

Progressively increases the sequence length during training.

Changes the sequence length of all tensors in the provided dictionary to curr_seq_len, by either truncating the tensors (truncate=True) or reshaping the tensors to create new examples from the extra tokens (truncate=False).

The schedule for curr_seq_len over training time should be managed out of this function.

Parameters
  • batch – The input batch to the model, must be a dictionary.

  • curr_seq_length (int) – The desired sequence length to apply.

  • truncate (bool) – Truncate sequences early, or reshape tensors to create new examples out of the extra tokens.

Returns

batch – a Mapping of input tensors to the model, where all tensors have curr_seq_len in the second dimension.

Sharpness-Aware Minimization

Algorithm

class composer.algorithms.sam.SAM(rho: float = 0.05, epsilon: float = 1e-12, interval: int = 1)[source]

Adds sharpness-aware minimization (Foret et al. 2020) by wrapping an existing optimizer with a SAMOptimizer.

Parameters
  • rho – The neighborhood size parameter of SAM. Must be greater than 0.

  • epsilon – A small value added to the gradient norm for numerical stability.

  • interval – SAM will run once per interval steps. A value of 1 will cause SAM to run every step. Steps on which SAM runs take roughly twice as much time to complete.

class composer.algorithms.sam.SAMHparams(rho: float = 0.05, epsilon: float = 1e-12, interval: int = 1)[source]

See SAM

Scaling the Learning Rate Schedule

Algorithm

class composer.algorithms.scale_schedule.ScaleSchedule(ratio: float, method: str = 'epoch')[source]

Makes the learning rate schedule take a different number of epochs.

Training for less time is a strong baseline approach to speeding up training, provided that the training still gets through the entire learning rate schedule. E.g., training for half as long often yields little accuracy degredation, provided that the learning rate schedule is rescaled to take half as long as well. In contrast, if the schedule is not rescaled, training for half as long would mean simply stopping halfway through the training curve, which does reach nearly as high an accuracy.

To see the difference, consider training for half as long using a cosine annealing learning rate schedule. If the schedule is not rescaled, training ends while the learning rate is still ~0.5. If the schedule is rescaled, training ends after passing through the full cosine curve, at a learning rate orders of .01 or smaller.

Parameters
  • ratio – The factor by which to scale the duration of the schedule. E.g., 0.5 makes the schedule take half as many epochs and 2.0 makes it take twice as many epochs.

  • method – Currently only "epochs" is supported.

Raises
class composer.algorithms.scale_schedule.ScaleScheduleHparams(ratio: float, method: str = 'epoch')[source]

See ScaleSchedule

Standalone

composer.algorithms.scale_schedule.scale_scheduler(scheduler: torch.optim.lr_scheduler._LRScheduler, ssr: float, orig_max_epochs: Optional[int] = None)[source]

Makes a learning rate schedule take a different number of epochs.

See ScaleSchedule for more information.

Parameters
  • scheduler

    A learning rate schedule object. Must be one of:

    • torch.optim.lr_scheduler.CosineAnnealingLR

    • torch.optim.lr_scheduler.CosineAnnealingWarmRestarts

    • torch.optim.lr_scheduler.ExponentialLR

    • torch.optim.lr_scheduler.MultiStepLR

    • torch.optim.lr_scheduler.StepLR

  • ssr – the factor by which to scale the duration of the schedule. E.g., 0.5 makes the schedule take half as many epochs and 2.0 makes it take twice as many epochs.

  • orig_max_epochs – the current number of epochs spanned by scheduler. Used along with ssr to determine the new number of epochs scheduler should span.

Raises

ValueError – If scheduler is not an instance of one of the above types.

Selective Backpropagation

Algorithm

class composer.algorithms.selective_backprop.SelectiveBackprop(start: float, end: float, keep: float, scale_factor: float, interrupt: int)[source]

Selectively backpropagate gradients from a subset of each batch (Jiang et al. 2019).

Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples, and only computes weight gradients over the pruned subset, reducing iteration time and speeding up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.

To speed up SB’s selection forward pass, the argument scale_factor can be used to spatially downsample input image tensors. The full-sized inputs will still be used for the weight gradient computation.

To preserve convergence, SB can be interrupted with vanilla minibatch gradient steps every interrupt steps. When interrupt=0, SB will be used at every step during the SB interval. When interrupt=2, SB will alternate with vanilla minibatch steps.

Parameters
  • start – SB interval start as fraction of training duration

  • end – SB interval end as fraction of training duration

  • keep – fraction of minibatch to select and keep for gradient computation

  • scale_factor – scale for downsampling input for selection forward pass

  • interrupt – interrupt SB with a vanilla minibatch step every interrupt batches

class composer.algorithms.selective_backprop.SelectiveBackpropHparams(start: float, end: float, keep: float, scale_factor: float, interrupt: int)[source]

See SelectiveBackprop

Squeeze-and-Excitation

Algorithm

class composer.algorithms.squeeze_excite.SqueezeExcite(latent_channels: float = 64, min_channels: int = 128)[source]

Adds Squeeze-and-Excitation blocks (Hu et al. 2019) after the Conv2d modules in a neural network.

See SqueezeExcite2d for more information.

Parameters
  • latent_channels – Dimensionality of the hidden layer within the added MLP. If less than 1, interpreted as a fraction of num_features.

  • min_channels – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.

class composer.algorithms.squeeze_excite.SqueezeExciteHparams(latent_channels: float = 64, min_channels: int = 128)[source]

See SqueezeExcite

Standalone

class composer.algorithms.squeeze_excite.SqueezeExcite2d(num_features: int, latent_channels: float = 0.125)[source]

Squeeze-and-Excitation block from (Hu et al. 2019)

This block applies global average pooling to the input, feeds the resulting vector to a single-hidden-layer fully-connected network (MLP), and uses the output of this MLP as attention coefficients to rescale the input. This allows the network to take into account global information about each input, as opposed to only local receptive fields like in a convolutional layer.

Parameters
  • num_features – Number of features or channels in the input

  • latent_channels – Dimensionality of the hidden layer within the added MLP. If less than 1, interpreted as a fraction of num_features.

class composer.algorithms.squeeze_excite.SqueezeExciteConv2d(*args, latent_channels=0.125, conv: Optional[torch.nn.modules.conv.Conv2d] = None, **kwargs)[source]

Helper class used to add a SqueezeExcite2d module after a Conv2d.

composer.algorithms.squeeze_excite.apply_se(model: torch.nn.modules.module.Module, latent_channels: float, min_channels: int)[source]

See SqueezeExcite

Stochastic Depth

Algorithm

class composer.algorithms.stochastic_depth.StochasticDepth(stochastic_method: str, target_layer_name: str, drop_rate: float = 0.2, drop_distribution: str = 'linear', drop_warmup: float = 0.0, use_same_gpu_seed: bool = True)[source]

Applies Stochastic Depth (Huang et al.) to the specified model.

The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The block-wise version follows the original paper. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo.

Parameters
  • stochastic_method – The version of stochastic depth to use. "block" randomly drops blocks during training. "sample" randomly drops samples within a block during training.

  • target_layer_name – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, only 'ResNetBottleneck' is supported.

  • drop_rate – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.

  • drop_distribution – How drop_rate is distributed across layers. Value must be one of "uniform" or "linear". "uniform" assigns the same drop_rate across all layers. "linear" linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.

  • drop_warmup – Percentage of training epochs to linearly increase the drop probability to linear_drop_rate. Must be between 0.0 and 1.0.

  • use_same_gpu_seed – Set to True to have the same layers dropped across GPUs when using multi-GPU training. Set to False to have each GPU drop a different set of layers. Only used with "block" stochastic method.

class composer.algorithms.stochastic_depth.StochasticDepthHparams(stochastic_method: str, target_layer_name: str, drop_rate: float = 0.2, drop_distribution: str = 'linear', use_same_gpu_seed: bool = True, drop_warmup: float = 0.0)[source]

See StochasticDepth

Standalone

class composer.algorithms.stochastic_depth.StochasticBottleneck(drop_rate: float, module_id: int, module_count: int, use_same_gpu_seed: bool, use_same_depth_across_gpus: bool, rand_generator: torch._C.Generator, **kwargs)[source]

Stochastic ResNet Bottleneck block. This block has a probability of skipping the transformation section of the layer and scales the transformation section output by (1 - drop probability) during inference.

Parameters
  • drop_rate – Probability of dropping the block. Must be between 0.0 and 1.0.

  • module_id – The placement of the block within a network e.g. 0 for the first layer in the network.

  • module_count – The total number of blocks of this type in the network

  • use_same_gpu_seed – Set to True to have the same layers dropped across GPUs when using multi-GPU training. Set to False to have each GPU drop a different set of layers. Only used with "block" stochastic method.

  • use_same_depth_across_gpus – Set to True to have the same number of blocks dropped across GPUs. Should be set to True when drop_distribution is "uniform" and set to False for "linear".

composer.algorithms.stochastic_depth.apply_stochastic_depth(model: torch.nn.modules.module.Module, stochastic_method: str, target_layer_name: str, drop_rate: float = 0.2, drop_distribution: str = 'linear', use_same_gpu_seed: bool = True) None[source]

Applies Stochastic Depth (Huang et al.) to the specified model.

The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The block-wise version follows the original paper. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo.

Parameters
  • model – model containing modules to be replaced with stochastic versions

  • stochastic_method – The version of stochastic depth to use. "block" randomly drops blocks during training. "sample" randomly drops samples within a block during training.

  • target_layer_name – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, only 'ResNetBottleneck' is supported.

  • drop_rate – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.

  • drop_distribution – How drop_rate is distributed across layers. Value must be one of "uniform" or "linear". "uniform" assigns the same drop_rate across all layers. "linear" linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.

  • use_same_gpu_seed – Set to True to have the same layers dropped across GPUs when using multi-GPU training. Set to False to have each GPU drop a different set of layers. Only used with "block" stochastic method.

Stochastic Weight Averaging

Algorithm

class composer.algorithms.swa.SWA(swa_start: float = 0.8, anneal_epochs: int = 10, swa_lr: Optional[float] = None)[source]

Apply Stochastic Weight Averaging (Izmailov et al.)

Stochastic Weight Averaging (SWA) averages model weights sampled at different times near the end of training. This leads to better generalization than just using the final trained weights.

Because this algorithm needs to maintain both the current value of the weights and the average of all of the sampled weights, it doubles the model’s memory consumption. Note that this does not mean that the total memory required doubles, however, since stored activations and the optimizer state are not doubled.

Parameters
  • swa_start – fraction of training completed before stochastic weight averaging is applied

  • swa_lr – the final learning rate used for weight averaging

Note that ‘anneal_epochs’ is not used in the current implementation

class composer.algorithms.swa.SWAHparams(swa_start: float = 0.8, anneal_epochs: int = 10, swa_lr: Optional[float] = None)[source]

See SWA