composer.algorithms

We describe programmatic modifications to the model or training process as “algorithms.” Examples include smoothing the labels and adding Squeeze-and-Excitation blocks, among many others.

Algorithms can be used in two ways:

Using Algorithm objects. These objects provide callbacks to be run in the training loop.
Using algorithm-specific functions and classes, such as smooth_labels or SqueezeExcite2d.

The former are the easier to compose together, since they all have the same public interface and work automatically with the Composer Trainer. The latter are easier to integrate piecemeal into an existing codebase.

See Algorithm for more information.

The following algorithms are available in Composer:

`Alibi`	Algorithm to apply ALiBi to the model.
`AugMix`	Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data
`BlurPool`	Algorithm to apply BlurPool to the model.
`ChannelsLast`	ChannelsLast algorithm runs on Event.TRAINING_START and changes the memory format of the model to torch.channels_last.
`ColOut`
`CutOut`	Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level.
`GhostBatchNorm`	Algorithm to apply Ghost Batch Normalization to the model.
`LabelSmoothing`	Applies label smoothing during before_loss, then restores the original labels during after_loss.
`LayerFreezing`	Algorithm to apply Layer Freezing to the model.
`MixUp`	Applies MixUp algorithm by modifying the images and labels during Event.AFTER_DATALOADER.
`ProgressiveResizing`	Applies the 'progressive resizing' data augmentation algorithm to speed up training.
`RandAugment`	Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical
`SAM`	Applies SAM by wrapping existing optimizers with the SAMOptimizer.
`ScaleSchedule`	Scale the learning rate schedule
`SqueezeExcite`	Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.
`StochasticDepth`	Algorithm to replace a specified block with a stochastic version of the block.
`SWA`	Apply Stochastic Weight Averaging

Alibi

Algorithm

class composer.algorithms.alibi.Alibi(position_embedding_attribute, attention_module_name, attr_to_replace, alibi_attention, mask_replacement_function, heads_per_layer, max_sequence_length, train_sequence_length_scaling)[source]

Algorithm to apply ALiBi to the model. Runs on Event.INIT. This algorithm should be applied before the model has been moved to accelerators.

Parameters

heads_per_layer (int) – number of attention heads per layer
max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error
position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.
attention_module_name (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is “transformers.models.gpt2.modeling_gpt2.GPT2Attention”.
attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.
alibi_attention (str) – Path to new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.
mask_replacement_function (str) – Path to function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.
train_sequence_length_scaling (float) – Amount by which to scale training sequence length. One batch of training data will be reshaped from size (sequence_length, batch) to (sequence_length*sequence_length_fraction, batch/sequence_length_fraction).

class composer.algorithms.alibi.AlibiHparams(position_embedding_attribute: 'str', attention_module_name: 'str', attr_to_replace: 'str', alibi_attention: 'str', mask_replacement_function: 'Union[str, None]' = None, heads_per_layer: 'Union[int, Optional[None]]' = None, max_sequence_length: 'int' = 8192, train_sequence_length_scaling: 'float' = 0.25)[source]

Parameters

position_embedding_attribute (str) –
attention_module_name (str) –
attr_to_replace (str) –
alibi_attention (str) –
mask_replacement_function (Optional[str]) –
heads_per_layer (Optional[int]) –
max_sequence_length (int) –
train_sequence_length_scaling (float) –

Return type

None

Standalone

composer.algorithms.alibi.apply_alibi(model, heads_per_layer, max_sequence_length, position_embedding_attribute, attention_module, attr_to_replace, alibi_attention, mask_replacement_function)[source]

Applies ALiBi to the provided model. Removes position embeddings and replaces: the attention function and attention mask.

Parameters

model (torch.nn.Module) – model to transform
heads_per_layer (int) – number of attention heads per layer
max_sequence_length (int) – maximum sequence length that the model will be able to accept without returning an error
position_embedding_attribute (str) – attribute for position embeddings. For example in HuggingFace’s GPT2, the position embeddings are “transformer.wpe”.
attention_module (str) – module/class that will have its self-attention function replaced. For example, in HuggingFace’s GPT, the self-attention module is transformers.models.gpt2.modeling_gpt2.GPT2Attention.
attr_to_replace (str) – attribute that self-attention function will replace. For example, in HuggingFace’s GPT2, the self-attention function is “_attn”.
alibi_attention (callable) – new self-attention function in which ALiBi is implemented. Used to replace “{attention_module}.{attr_to_replace}”.
mask_replacement_function (callable) – function to replace model’s attention mask. This is sometimes necessary for evaluating on sequence lengths longer than the model was initialized to accommodate.

Return type

None

Augmix

Algorithm

class composer.algorithms.augmix.AugMix(severity=3, depth=- 1, width=3, alpha=1.0, augmentation_set='all')[source]

Object that does AugMix (Hendrycks et al. (2020), AugMix: A Simple Data: Processing Method to Improve Robustness and Uncertainty). Can be passed as a transform to torchvision.transforms.Compose().

Parameters

severity – Severity of augmentation operators (between 1 to 10).
width – Width of augmentation chains (number of parallel augmentations)
depth – Depth of each augmentation chain. -1 enables stochastic depth uniformly from [1, 3]
alpha – Probability coefficient for Beta and Dirichlet distributions. Sampling from Dirichlet determines relative weights of each augmented image sampling from beta determines relative weights of unaugmented and augmented images.
augmentation_set – String, one of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.

class composer.algorithms.augmix.AugMixHparams(severity: int = 3, depth: int = - 1, width: int = 3, alpha: float = 1.0, augmentation_set: str = 'all')[source]

Parameters

severity (int) –
depth (int) –
width (int) –
alpha (float) –
augmentation_set (str) –

Return type

None

Standalone

composer.algorithms.augmix.augment_and_mix(img=None, severity=3, depth=-1, width=3, alpha=1.0, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]

Perform augmentations.

Parameters

img (Optional[PIL.Image.Image]) –
severity (int) –
depth (int) –
width (int) –
alpha (float) –
augmentation_set (List) –

Return type

PIL.Image.Image

BlurPool

Algorithm

class composer.algorithms.blurpool.BlurPool(replace_convs, replace_maxpools, blur_first)[source]

Algorithm to apply BlurPool to the model. Runs on Event.INIT. This algorithm should be applied before the model has been moved to devices.

Parameters

replace_convs (bool) – replace eligible Conv2D with BlurConv2d. Default: True.
replace_maxpools (bool) – replace eligible MaxPool2d with BlurMaxPool2d. Default: True.
blur_first (bool) – for replace_convs, blur input before conv. Default: True

class composer.algorithms.blurpool.BlurPoolHparams(replace_convs: 'bool' = True, replace_maxpools: 'bool' = True, blur_first: 'bool' = True)[source]

Parameters

replace_convs (bool) –
replace_maxpools (bool) –
blur_first (bool) –

Return type

None

Standalone

class composer.algorithms.blurpool.BlurConv2d(in_channels, out_channels, kernel_size, stride=None, padding=0, dilation=1, groups=1, bias=True, blur_first=True)[source]

This module is a drop-in replacement for PyTorch’s Conv2d, but with an anti-aliasing filter applied.

It should be used only to replace strided convolutions.

See the associated paper for more details, experimental results, etc.

Channels Last

Algorithm

class composer.algorithms.channels_last.ChannelsLast(*args, **kwargs)[source]: ChannelsLast algorithm runs on Event.TRAINING_START and changes the memory format of the model to torch.channels_last. This algorithm has no hyperparameters.

class composer.algorithms.channels_last.ChannelsLastHparams[source]

ChannelsLast algorithm has no hyperparameters.

Return type: None

ColOut

Algorithm

class composer.algorithms.colout.ColOut(p_row=0.15, p_col=0.15, batch=True)[source]

class composer.algorithms.colout.ColOutHparams(p_row: 'float' = 0.15, p_col: 'float' = 0.15, batch: 'bool' = True)[source]

Parameters

p_row (float) –
p_col (float) –
batch (bool) –

Return type

None

Standalone

composer.algorithms.colout.colout(img, p_row, p_col)[source]

Drops random rows and columns from a single image.

Parameters

img (torch.Tensor or PIL Image) – An input image as a torch.Tensor or PIL image
p_row (float) – Fraction of rows to drop (drop along H).
p_col (float) – Fraction of columns to drop (drop along W).

Returns

torch.Tensor or PIL Image – A smaller image with rows and columns dropped

Return type

Union[torch.Tensor, PIL.Image.Image]

CutOut

Algorithm

class composer.algorithms.cutout.CutOut(n_holes, length)[source]

Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level.

Parameters

X (Tensor) – Batch Tensor image of size (B, C, H, W).
n_holes – Integer number of holes to cut out
length – Side length of the square hole to cut out.

class composer.algorithms.cutout.CutOutHparams(n_holes: int, length: int)[source]

Parameters

n_holes (int) –
length (int) –

Return type

None

Standalone

composer.algorithms.cutout.cutout(X, n_holes, length)[source]

Implements CutOut augmentation from https://arxiv.org/abs/1708.04552 on the batch level. Adapted from the implementation in https://github.com/uoguelph-mlrg/Cutout

Parameters

X (Tensor) – Batch Tensor image of size (B, C, H, W).
n_holes (int) – Integer number of holes to cut out
length (int) – Side length of the square hole to cut out.

Returns

X_cutout – Image with n_holes of dimension length x length cut out of it.

Return type

torch.Tensor

Ghost Batch Normalization

Algorithm

class composer.algorithms.ghost_batchnorm.GhostBatchNorm(ghost_batch_size=32)[source]

Algorithm to apply Ghost Batch Normalization to the model.

This entails replacing all of the batch normalization modules with ghost batch normalization modules on Event.INIT.

Parameters: ghost_batch_size – size of sub-batches to normalize over

class composer.algorithms.ghost_batchnorm.GhostBatchNormHparams(ghost_batch_size: 'int')[source]

Parameters: ghost_batch_size (int) –
Return type: None

Standalone

composer.algorithms.ghost_batchnorm.apply_ghost_batchnorm(model, ghost_batch_size)[source]

Replaces batch normalization modules with ghost batch normalization modules

This algorithm should be applied before the model has been moved to accelerators, and before the model’s parameters have been passed to an optimizer.

Parameters

model (torch.nn.modules.module.Module) – model to transform
ghost_batch_size (int) – size of sub-batches to normalize over

Return type

torch.nn.modules.module.Module

Label Smoothing

Algorithm

class composer.algorithms.label_smoothing.LabelSmoothing(alpha)[source]

Applies label smoothing during before_loss, then restores the original labels during after_loss.

Parameters: alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)

class composer.algorithms.label_smoothing.LabelSmoothingHparams(alpha: float)[source]

Parameters: alpha (float) –
Return type: None

Standalone

composer.algorithms.label_smoothing.smooth_labels(logits, targets, alpha)[source]

Shrinks targets towards a prior distribution to counteract label noise.

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a pre-specified vector of class probabilities.

Introduced in: https://arxiv.org/abs/1512.00567 Evaluated in: https://arxiv.org/abs/1906.02629

Parameters

logits (torch.Tensor) – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.
targets (torch.Tensor) – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.
alpha (float) – Strength of the label smoothing, between [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored)

Layer Freezing

Algorithm

class composer.algorithms.layer_freezing.LayerFreezing(freeze_start=0.5, freeze_level=1.0)[source]

Algorithm to apply Layer Freezing to the model. Runs on Event.EPOCH_END. During training, progressively freeze the layers of the network starting with the earlier layers. Freezing starts after the percent of epochs specified by freeze_start have run. The fraction of layers increases linearly until it reaches freeze_level at the final epoch.

Parameters

freeze_start – The fraction of epochs to run before freezing begins.
freeze_level – The maximum fraction of levels to freeze.

class composer.algorithms.layer_freezing.LayerFreezingHparams(freeze_start: 'float' = 0.5, freeze_level: 'float' = 1.0)[source]

Parameters

freeze_start (float) –
freeze_level (float) –

Return type

None

Standalone

composer.algorithms.layer_freezing.freeze_layers(model, optimizers, current_epoch, max_epochs, freeze_start, freeze_level, logger)[source]

Implements the layer freezing algorithm. During training, progressively freeze the layers of the network starting with the earlier layers.

Parameters

model (torch.nn.modules.module.Module) – An instance of the model being trained.
optimizers (Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]]) – The optimizers used during training.
current_epoch (int) – Integer specifying the current epoch.
max_epochs (int) – The max number of epochs training will run for.
freeze_start (float) – The fraction of epochs to run before freezing begins.
freeze_level (float) – The maximum fraction of levels to freeze.
logger (composer.core.logging.logger.Logger) –

MixUp

Algorithm

class composer.algorithms.mixup.MixUp(alpha)[source]: Applies MixUp algorithm by modifying the images and labels during Event.AFTER_DATALOADER.

class composer.algorithms.mixup.MixUpHparams(alpha: float)[source]

Parameters: alpha (float) –
Return type: None

Standalone

composer.algorithms.mixup.mixup_batch(x, y, interpolation_lambda, n_classes, indices=None)[source]

Implements mixup on a single batch of data.

This constructs a new batch of data given an original batch. This is done through the convex combination of x with a randomly permuted copy of x. The interploation parameter lambda should be chosen from a beta distribution with parameter alpha. Note that the same lambda is used for all examples within the batch.

Both the original and shuffled labels are returned. This is done because for many loss functions (such as cross entropy) the targets are given as indices, so interpolation must be handled separately.

Parameters

x (torch.Tensor) – Input tensor of shape (B, d1, d2, …, dn), B is batch size, d1-dn are feature dimensions.
y (torch.Tensor) – Target tensor of shape (B, f1, f2, …, fm), B is batch size, f1-fn are possible target dimensions.
interpolation_lambda (float) – Amount of interpolation based on alpha.
n_classes (int) – Total number of classes.
indices (Optional[torch.Tensor]) – Tensor of shape (B). Permutation of the batch indices. Used for permuting without randomness.

Returns

x_mix – Batch of inputs after mixup has been applied.
y_mix – Labels after mixup has been applied.

Progressive Resizing

Algorithm

class composer.algorithms.progressive_resizing.ProgressiveResizing(mode, initial_scale, finetune_fraction, resize_targets)[source]

Applies the ‘progressive resizing’ data augmentation algorithm to speed up training. See Training a State-of-the-Art Model <https://github.com/fastai/fastbook/blob/780b76bef3127ce5b64f8230fce60e915a7e0735/07_sizing_and_tta.ipynb>`__.

“Progressive resizing” initially scales inputs down to speed up early training. Throughout training, the scaling factor is gradually increased, yielding larger inputs up to the original input size. A final finetuning period is then run to finetune the model using the full-sized inputs.

Parameters

mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘resize’.
initial_scale (float) – Initial scale factor used to shrink the inputs. Must be a value in between 0 and 1.
finetune_fraction (float) – Fraction of training to reserve for finetuning on the full-sized inputs. Must be a value in between 0 and 1.
resize_targets (bool) – If True, resize targets also.

class composer.algorithms.progressive_resizing.ProgressiveResizingHparams(mode='resize', initial_scale=0.5, finetune_fraction=0.2, resize_targets=False)[source]

Hyperparameters for the ‘progressive resizing’ algorithm

Parameters

mode (str) –
initial_scale (float) –
finetune_fraction (float) –
resize_targets (bool) –

Return type

None

Standalone

composer.algorithms.progressive_resizing.resize_inputs(X, y, scale_factor, mode='resize', resize_targets=False)[source]

Resize inputs and optionally outputs by cropping or interpolating.

Parameters

X (torch.Tensor) – Input tensor of shape (N, C, H, W). Resizing will be done along dimensions H and W using the constant factor scale_factor.
y (torch.Tensor) – If resize_targets is True, output tensor of shape (N, C, H, W) that will also be resized.
scale_factor (float) – Scaling coefficient for the height and width of the input/output tensor. 1.0 keeps the original size.
mode (str) – Type of scaling to perform. Value must be one of ‘crop’ or ‘resize’. ‘crop’ performs a random crop, whereas ‘resize’ performs a bilinear interpolation. Default: ‘crop’.
resize_targets (bool) – Resize the targets, y, as well. Default: False.

Returns

X_sized (torch.Tensor) – Resized input tensor of shape (N, C, H * scale_factor, W * scale_factor).
y_sized (torch.Tensor) – If resized_targets is True, resized output tensor of shape (N, C, H * scale_factor, W * scale_factor). Returns original y, otherwise.

Return type

Tuple[torch.Tensor, torch.Tensor]

RandAugment

Algorithm

class composer.algorithms.randaugment.RandAugment(severity=9, depth=2, augmentation_set='all')[source]

Object that does RandAugment (Cubuk et al. (2019), RandAugment: Practical: automated data augmentation with a reduced search space). Can be passed as a transform to torchvision.transforms.Compose().

Parameters

severity (int) – Severity of augmentation operators (between 1 to 10). M in the original paper. Default = 9.
depth (int) – Depth of augmentation chain. N in the original paper Default = 2.
augmentation_set (str) – One of [“augmentations_all”, “augmentations_corruption_safe”, “augmentations_original”]. Set of augmentations to use. “augmentations_corruption_safe” excludes transforms that are part of the ImageNet-C/CIFAR10-C test sets. “augmentations_original” uses all augmentations, but some of the implementations are identical to the original github repo, which appears to contain implementation specificities for the augmentations “color”, “contrast”, “sharpness”, and “brightness”.

class composer.algorithms.randaugment.RandAugmentHparams(severity: int = 9, depth: int = 2, augmentation_set: str = 'all')[source]

Parameters

severity (int) –
depth (int) –
augmentation_set (str) –

Return type

None

Standalone

composer.algorithms.randaugment.randaugment(img=None, severity=9, depth=2, augmentation_set=[<function autocontrast>, <function equalize>, <function posterize>, <function rotate>, <function solarize>, <function shear_x>, <function shear_y>, <function translate_x>, <function translate_y>, <function color>, <function contrast>, <function brightness>, <function sharpness>])[source]

Perform augmentations.

Parameters

img (Optional[PIL.Image.Image]) –
severity (int) –
depth (int) –
augmentation_set (List) –

Return type

PIL.Image.Image

Sequence Length Warmup

Algorithm

Standalone

Sharpness-Aware Minimization

Algorithm

class composer.algorithms.sam.SAM(rho=0.05, epsilon=1e-12, interval=1)[source]: Applies SAM by wrapping existing optimizers with the SAMOptimizer.

class composer.algorithms.sam.SAMHparams(rho: 'float' = 0.05, epsilon: 'float' = 1e-12, interval: 'int' = 1)[source]

Parameters

rho (float) –
epsilon (float) –
interval (int) –

Return type

None

Scaling the Learning Rate Schedule

Algorithm

class composer.algorithms.scale_schedule.ScaleSchedule(ratio, method='epoch')[source]

Scale the learning rate schedule

Parameters

ratio (float) – Ratio of full training schedule
( (method) – obj: str, optional): Step or epoch, defaults to epoch

class composer.algorithms.scale_schedule.ScaleScheduleHparams(ratio: float, method: str = 'epoch')[source]

Parameters

ratio (float) –
method (str) –

Return type

None

Standalone

composer.algorithms.scale_schedule.scale_scheduler(scheduler, ssr, orig_max_epochs=None)[source]

Parameters

scheduler (torch.optim.lr_scheduler._LRScheduler) –
ssr (float) –
orig_max_epochs (Optional[int]) –

Selective Backpropagation

Algorithm

class composer.algorithms.selective_backprop.SelectiveBackprop(start, end, keep, scale_factor, interrupt)[source]

Selectively backprop on a subset of each batch.

Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples, and only computes weight gradients over the pruned subset, reducing iteration time and speeding up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.

See Accelerating Deep Learning by Focusing on the Biggest Losers <https://arxiv.org/abs/1910.00762>.

To speed up SB’s selection forward pass, the argument scale_factor can be used to downsample input image tensors. The full-sized inputs will still be used for the weight gradient computation.

To preserve convergence, SB can be interrupted with vanilla minibatch gradient steps every interrupt steps. When interrupt=0, SB will be used at every step during the SB interval. When interrupt=2, SB will alternate with vanilla minibatch steps.

Parameters

start (float) – SB interval start as fraction of training duration
end (float) – SB interval end as fraction of training duration
keep (float) – fraction of minibatch to select and keep for gradient computation
scale_factor (float) – scale for downsampling input for selection forward pass
interrupt (int) – interrupt SB with a vanilla minibatch step every ‘interrupt’ batches

class composer.algorithms.selective_backprop.SelectiveBackpropHparams(start: 'float', end: 'float', keep: 'float', scale_factor: 'float', interrupt: 'int')[source]

Parameters

start (float) –
end (float) –
keep (float) –
scale_factor (float) –
interrupt (int) –

Return type

None

Squeeze-and-Excitation

Algorithm

class composer.algorithms.squeeze_excite.SqueezeExcite(latent_channels=64, min_channels=128)[source]

Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.

Parameters

latent_channels – The dimensionality of the hidden layer within the added MLP.
min_channels – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.

class composer.algorithms.squeeze_excite.SqueezeExciteHparams(latent_channels: 'float' = 64, min_channels: 'int' = 128)[source]

Parameters

latent_channels (float) –
min_channels (int) –

Return type

None

Standalone

class composer.algorithms.squeeze_excite.SqueezeExcite2d(num_features, latent_channels=0.125)[source]: Squeeze-and-Excitation block

class composer.algorithms.squeeze_excite.SqueezeExciteConv2d(*args, latent_channels=0.125, conv=None, **kwargs)[source]

Helper class used to add a Squeeze-and-Excitation block after a Conv2d.

Parameters: conv (torch.nn.Conv2d) –

composer.algorithms.squeeze_excite.apply_se(model, latent_channels, min_channels)[source]

Adds Squeeze-and-Excitation <https://arxiv.org/abs/1709.01507>`_ (SE) blocks after the Conv2d layers of a neural network.

Parameters

model (torch.nn.modules.module.Module) – A module containing one or more torch.nn.Conv2d modules.
latent_channels (float) – The dimensionality of the hidden layer within the added MLP.
min_channels (int) – An SE block is added after a Conv2d module conv only if min(conv.in_channels, conv.out_channels) >= min_channels. For models that reduce spatial size and increase channel count deeper in the network, this parameter can be used to only add SE blocks deeper in the network. This may be desirable because SE blocks add less overhead when their inputs have smaller spatial size.

Stochastic Depth

Algorithm

class composer.algorithms.stochastic_depth.StochasticDepth(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', drop_warmup=0.0, use_same_gpu_seed=True)[source]

Algorithm to replace a specified block with a stochastic version of the block.

The stochastic block will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.

Parameters

stochastic_method – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Which block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or a sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be either ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
drop_warmup (float) – Percentage of training epochs to linearly increase the drop probability to linear_drop_rate. Must be between 0.0 and 1.0.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set false to have each GPU drop a different set of layers. Only used with “block” stochastic method.

class composer.algorithms.stochastic_depth.StochasticDepthHparams(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True, drop_warmup=0.0)[source]

Hyperparameters for the Stochastic Depth algorithm

Parameters

stochastic_method (str) –
target_layer_name (str) –
drop_rate (float) –
drop_distribution (str) –
use_same_gpu_seed (bool) –
drop_warmup (float) –

Return type

None

Standalone

class composer.algorithms.stochastic_depth.StochasticBottleneck(drop_rate, module_id, module_count, use_same_gpu_seed, use_same_depth_across_gpus, rand_generator, **kwargs)[source]

Stochastic ResNet Bottleneck layer. This layer has a probability of skipping the transformation section of the layer and scales the transformation section output by (1 - drop probability) during inference.

Parameters

drop_rate (float) – Probability of dropping the layer. Must be between 0.0 and 1.0.
module_id (int) – The placement of the layer within a network e.g. 0 for the first layer in the network.
module_count (int) – The total number of layers of this type in the network.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers.
use_same_depth_across_gpus (bool) – Set to true to have the same number of layers dropped across GPUs. Set to true when drop_distribution is ‘uniform’ and set to false for ‘linear’.
rand_generator (torch._C.Generator) –

composer.algorithms.stochastic_depth.apply_stochastic_depth(model, stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True)[source]

Applies Stochastic Depth algorithm to the specified model.

The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.

Parameters

stochastic_method (str) – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be one of ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers. Only used with “block” stochastic method.
model (torch.nn.modules.module.Module) –

Return type

None

Stochastic Weight Averaging

Algorithm

class composer.algorithms.swa.SWA(swa_start=0.8, anneal_epochs=10, swa_lr=None)[source]

Apply Stochastic Weight Averaging

Stochastic Weight Averaging (SWA) averages model weights sampled towards the end of training. This leads to better generalization than conventional training.

See Averaging Weights Leads to Wider Optima and Better Generalization <https://arxiv.org/abs/1803.05407>.

Parameters

swa_start (float) – fraction of training completed before stochastic weight averaging is applied
anneal_epochs (int) – The final learning rate to anneal towards
swa_lr (float) – fraction of minibatch to select and keep for gradient computation

class composer.algorithms.swa.SWAHparams(swa_start: 'float' = 0.8, anneal_epochs: 'int' = 10, swa_lr: 'Optional[float]' = None)[source]

Parameters

swa_start (float) –
anneal_epochs (int) –
swa_lr (Optional[float]) –

Return type

None