MixUp

Image from mixup: Beyond Empirical Risk Minimization by Zhang et al., 2018

Tags: Vision, Increased Accuracy, Increased GPU Usage, Method, Augmentation, Regularization

TL;DR

MixUp trains the network on convex combinations of examples and targets rather than individual examples and targets. Training in this fashion improves generalization performance.

Attribution

mixup: Beyond Empirical Risk Minimization by Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Published in ICLR 2018.

Hyperparameters

alpha - The parameter that controls the distribution of interpolation values sampled when performing MixUp. Our implementation samples these interpolation values from a symmetric Beta distribution, meaning that alpha serves as both parameters for the Beta distribution.

Example Effects

MixUp is intended to improve generalization performance, and we empirically find this to be the case in our image classification settings. The original paper also reports a reduction in memorization and improved adversarial robustness.

Implementation Details

Mixed samples are created from a batch (X, y) of (inputs, targets) together with version (X', y') where the ordering of examples has been shuffled. The examples can be mixed by sampling a value t (between 0.0 and 1.0) from the Beta distribution parameterized by alpha and training the network on the interpolation between (X, y) and (X', y') specified by t

Note that the same t is used for each example in the batch. Using the shuffled version of a \batch to generate mixed samples allows MixUp to be used without loading additional data.

Suggested Hyperparameters

alpha = 0.2 is a good default for training on ImageNet.
alpha = 1 works for CIFAR10

Considerations

MixUp adds a little extra GPU compute and memory to create the mixed samples.
MixUp also requires a cost function that can accept dense target vectors, rather than an index of a corresponding 1-hot vector as is a common default (e.g., cross entropy with hard labels).

Composability

As general rule, combining regularization-based methods yields sublinear improvements to accuracy. This holds true for MixUp.

This method interacts with other methods (such as CutOut) that alter the inputs or the targets (such as label smoothing). While such methods may still compose well with MixUp in terms of improved accuracy, it is important to ensure that the implementations of these methods compose.

Code

class composer.algorithms.mixup.MixUp(alpha: float)[source]

MixUp trains the network on convex combinations of pairs of examples and targets rather than individual examples and targets.

This is done by taking a convex combination of a given batch X with a randomly permuted copy of X. The mixing coefficient is drawn from a Beta(alpha, alpha) distribution.

Training in this fashion reduces generalization error.

Parameters: alpha – the psuedocount for the Beta distribution used to sample interpolation parameters. As alpha grows, the two samples in each pair tend to be weighted more equally. As alpha approaches 0 from above, the combination approaches only using one element of the pair.

apply(event: composer.core.event.Event, state: composer.core.state.State, logger: composer.core.logging.logger.Logger) → None[source]

Applies MixUp augmentation on State input

Parameters

event (Event) – the current event
state (State) – the current trainer state
logger (Logger) – the training logger

match(event: composer.core.event.Event, state: composer.core.state.State) → bool[source]

Runs on Event.INIT and Event.AFTER_DATALOADER

Parameters

event (Event) – The current event.
state (State) – The current state.

Returns

bool – True if this algorithm should run now.

composer.algorithms.mixup.mixup.gen_interpolation_lambda(alpha: float) → float[source]: Generates Beta(alpha, alpha) distribution

composer.algorithms.mixup.mixup_batch(x: torch.Tensor, y: torch.Tensor, interpolation_lambda: float, n_classes: int, indices: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Create new samples using convex combinations of pairs of samples.

This is done by taking a convex combination of x with a randomly permuted copy of x. The interploation parameter lambda should be chosen from a Beta(alpha, alpha) distribution for some parameter alpha > 0. Note that the same lambda is used for all examples within the batch.

Both the original and shuffled labels are returned. This is done because for many loss functions (such as cross entropy) the targets are given as indices, so interpolation must be handled separately.

Parameters

x – input tensor of shape (B, d1, d2, …, dn), B is batch size, d1-dn are feature dimensions.
y – target tensor of shape (B, f1, f2, …, fm), B is batch size, f1-fn are possible target dimensions.
interpolation_lambda – amount of interpolation based on alpha.
n_classes – total number of classes.
indices – Permutation of the batch indices 1..B. Used for permuting without randomness.

Returns

x_mix – batch of inputs after mixup has been applied
y_mix – labels after mixup has been applied
perm – the permutation used

Example

from composer import functional as CF

for X, y in dataloader:

l = CF.gen_interpolation_lambda(alpha=0.2) X, y, _ = CF.mixup_batch(X, y, l, nclasses)

pred = model(X) loss = loss_fun(pred, y) # loss_fun must accept dense labels (ie NOT indices)