⏮️ Selective Backprop#

Tags: Vision, NLP, Decreased Accuracy, Increased GPU Throughput, Method, Curriculum, Speedup

TL;DR#

Selective Backprop prioritizes examples with high loss at each iteration, skipping examples with low loss. This speeds up training with limited impact on generalization.

Attribution#

Accelerating Deep Learning by Focusing on the Biggest Losers by Angela H. Jiang, Daniel L. K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminsky, Michael Kozuch, Zachary C. Lipton, and Padmanabhan Pillai.

Applicable Settings#

Selective Backprop is broadly applicable across problems and modalities. It’s one implementation among a class of methods that train first on the easy examples and focus on the hard examples later in training.

Hyperparameters#

start - The fraction of training epochs elapsed at which to start pruning examples. For example, start=0.5 with a total training epochs of 100 would have Selective Backprop begin at epoch 50.
end - The fraction of training epochs elapsed at which to stop pruning examples. This has to be larger than the value for start.
keep - The fraction of examples in a batch that should be kept.
interrupt - To alleviate some potential negative impacts on model performance, we do not prune the examples on a subset of batches within the interval that Selective Backprop is active. The interrupt parameter specifies the number of batches between these “vanilla”, unpruned batches.
scale_factor - The pruning process requires an additional forward pass in order to realize any speedup. Depending on the situation, this forward pass may be able to be performed on a downsampled version of the input. The scale_factor parameter controls the amount of downsampling to perform.

Example Effects#

Depending on the precise hyperparameters chosen, we see decreases in training time of around 10% without any degradation in performance. Larger values are possible but run into the speed-accuracy tradeoffs described below.

Implementation Details#

The goal of Selective Backprop is to reduce the number of examples the model sees to only those that still have high loss. This lets the model learn on fewer examples, speeding up forward and back propagation with limited impact on final model quality. To determine the per-example loss and which examples to skip, an additional, initial forward pass must be performed. These loss values are then used to weight a random sample of examples to use for training. For some data types, including images, it’s possible to use a lower resolution version of the input for this additional forward pass. This minimizes the extra computation while maintaining a good estimate of which examples are difficult for the model.

Suggested Hyperparameters#

start: Default: 0.5. The default is a good value for most use cases. It lets the model train normally for most of the run while still providing a large boost in time-to-train.
end: Default: 0.9. The default is a good value for most use cases. It leaves a small amount of training at the end for fine-tuning on all examples in the dataset.
keep: Default: 0.5. We found a value of 0.5 to represent a good tradeoff that greatly improves speed at limited cost. This is likely to be the hyperparameter most worth tuning.
interrupt: Default: 2. We found that including some unpruned batches is worth the tradeoff in speed, though a value of 0 is worth considering.
scale_factor: Default: 0.5. If you are using a data type and model that can tolerate processing a downsampled input, this is definitely worthwhile. The default value of 0.5 yields good results with much less computation.

Considerations#

Selective Backprop has many tradeoffs between speed and accuracy that are readily apparent. The more data you eliminate for longer periods of training, the larger the potential impact in model performance. The default values we provide have worked well for us and strike a good balance.

Composability#

This method should be performed before data augmentation so that eliminated examples do not need to be augmented.

Detailed Results#

We have explored Selective Backprop primarily on image recognition tasks such as ImageNet and CIFAR-10. For both of these, we see large improvements in training time with little degradation in accuracy. The table below shows some examples using the default hyperparameters from above. For CIFAR-10, ResNet-56 was trained on 1x NVIDIA 3080 GPU for 160 epochs. For ImageNet, ResNet-50 was trained on 8x NVIDIA 3090 GPUs for 90 epochs.

Selective Backprop#
Dataset	Run	Validation Acc.	Time to Train
ImageNet	Baseline	76.46%	5h 43m 8s
ImageNet	+Selective Backprop	76.46%	5h 22m 14s
CIFAR-10	Baseline	93.16%	35m 33s
CIFAR-10	+Selective Backprop	93.32%	32m 36s

Code#

class composer.algorithms.selective_backprop.SelectiveBackprop(start=0.5, end=0.9, keep=0.5, scale_factor=0.5, interrupt=2)[source]

Selectively backpropagate gradients from a subset of each batch (Jiang et al, 2019).

Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples, and only computes weight gradients over the pruned subset, reducing iteration time and speeding up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.

To speed up SB’s selection forward pass, the argument scale_factor can be used to spatially downsample input image tensors. The full-sized inputs will still be used for the weight gradient computation.

To preserve convergence, SB can be interrupted with vanilla minibatch gradient steps every interrupt steps. When interrupt=0, SB will be used at every step during the SB interval. When interrupt=2, SB will alternate with vanilla minibatch steps.

Parameters

start – SB interval start as fraction of training duration
end – SB interval end as fraction of training duration
keep – fraction of minibatch to select and keep for gradient computation
scale_factor – scale for downsampling input for selection forward pass
interrupt – interrupt SB with a vanilla minibatch step every interrupt batches

apply(event, state, logger=None)[source]: Apply selective backprop to the current batch.

match(event, state)[source]

Matches Event.INIT and Event.AFTER_DATALOADER

Uses Event.INIT to get the loss function before the model is wrapped
Uses Event.AFTER_DATALOADER` to apply selective backprop if time is between self.start and self.end.

composer.algorithms.selective_backprop.select_using_loss(X, y, model, loss_fun, keep, scale_factor=1)[source]

Selectively backpropagate gradients from a subset of each batch (Jiang et al, 2019).

Selective Backprop (SB) prunes minibatches according to the difficulty of the individual training examples and only computes weight gradients over the selected subset. This reduces iteration time and speeds up training. The fraction of the minibatch that is kept for gradient computation is specified by the argument 0 <= keep <= 1.

To speed up SB’s selection forward pass, the argument scale_factor can be used to spatially downsample input tensors. The full-sized inputs will still be used for the weight gradient computation.

Parameters

X – Input tensor to prune
y – Output tensor to prune
model – Model with which to predict outputs
loss_fun – Loss function of the form loss(outputs, targets, reduction='none'). The function must take the keyword argument reduction='none' to ensure that per-sample losses are returned.
keep – Fraction of examples in the batch to keep
scale_factor – Multiplier between 0 and 1 for spatial size. Downsampling requires the input tensor to be at least 3D.

Returns

(torch.Tensor, torch.Tensor) – The pruned batch of inputs and targets

Raises

ValueError – If scale_factor > 1
TypeError – If loss_fun > 1 has the wrong signature or is not callable

Note: This function runs an extra forward pass through the model on the batch of data. If you are using a non-default precision, ensure that this forward pass runs in your desired precision. For example:

with torch.cuda.amp.autocast(True):
    X_new, y_new = selective_backprop(X, y, model, loss_fun, keep, scale_factor)