Label Smoothing

Tags: Vision, NLP, Classification, Increased Accuracy, Method, Regularization

TL;DR

Label smoothing modifies the target distribution for a task by interpolating between the target distribution and a another distribution that usually has higher entropy. This typically reduces a model’s confidence in its outputs and serves as a form of regularization.

Attribution

The technique was originally introduced in Rethinking the Inception Architecture for Computer Vision by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathan Shlens, and Zbigniew Wojna. Released on arXiv in 2015.

The technique was further evaluated in When Does Label Smoothing Help? by Rafael Muller, Simon Kornblith, and Geoffrey Hinton. Published in NeurIPS 2015.

Applicable Settings

Label smoothing is applicable to any problem where targets are a categorical distribution. This includes classification with softmax cross-entropy and segmentation with a Dice loss.

Hyperparameters

  • alpha - A value between 0.0 and 1.0 that specifies the interpolation between the target distribution and a uniform distribution. For example. a value of 0.9 specifies that the target values should be multiplied by 0.9 and added to a uniform distribution multiplied by 0.1.

Example Effects

Label smoothing is intended to act as regularization, and so possible effects are changes (ideally improvement) in generalization performance. We find this to be the case on all of our image classification benchmarks, which see improved accuracy under label smoothing.

Implementation Details

Label smoothing replaces the one-hot encoded label with a combination of the true label and the uniform distribution. Care must be taken in ensuring the cost function used can accept the full categorical distribution instead of the index of the target value.

Suggested Hyperparameters

  • alpha = 0.1 is a standard starting point for label smoothing.

Considerations

In some cases, a small amount of extra memory and compute is needed to convert labels to dense targets. This can produce a (typically negligible) increase in iteration time.

Composability

This method interacts with other methods (such as MixUp) that alter the targets. While such methods may still compose well with label smoothing in terms of improved accuracy, it is important to ensure that the implementations of these methods compose.


Code

class composer.algorithms.label_smoothing.LabelSmoothing(alpha: float)[source]

Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a vector of ones.

Introduced in Rethinking the Inception Architecture for Computer Vision.

Parameters

alpha – Strength of the label smoothing, in [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored).

apply(event: composer.core.event.Event, state: composer.core.state.State, logger: composer.core.logging.logger.Logger) Optional[int][source]

Applies the algorithm to make an in-place change to the State

Can optionally return an exit code to be stored in a Trace.

Parameters
  • event (Event) – The current event.

  • state (State) – The current state.

  • logger (Logger) – A logger to use for logging algorithm-specific metrics.

Returns
  • ``int`` or ``None`` – exit code that is stored in Trace

  • and made accessible for debugging.

match(event: composer.core.event.Event, state: composer.core.state.State) bool[source]

Determines whether this algorithm should run, given the current Event and State.

Examples:

To only run on a specific event:

>>> return event == Event.BEFORE_LOSS

Switching based on state attributes:

>>> return state.epoch > 30 && state.world_size == 1

See State for accessible attributes.

Parameters
  • event (Event) – The current event.

  • state (State) – The current state.

Returns

bool – True if this algorithm should run now.

composer.algorithms.label_smoothing.smooth_labels(logits: torch.Tensor, targets: torch.Tensor, alpha: float)[source]

Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..

This is computed by (1 - alpha) * targets + alpha * smoothed_targets where smoothed_targets is a vector of ones.

Parameters
  • logits – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.

  • targets – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.

  • alpha – Strength of the label smoothing, in [0, 1]. alpha=0 means no label smoothing, and alpha=1 means maximal smoothing (targets are ignored).