Label Smoothing
Tags: Vision
, NLP
, Classification
, Increased Accuracy
, Method
, Regularization
TL;DR
Label smoothing modifies the target distribution for a task by interpolating between the target distribution and a another distribution that usually has higher entropy. This typically reduces a model’s confidence in its outputs and serves as a form of regularization.
Attribution
The technique was originally introduced in Rethinking the Inception Architecture for Computer Vision by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathan Shlens, and Zbigniew Wojna. Released on arXiv in 2015.
The technique was further evaluated in When Does Label Smoothing Help? by Rafael Muller, Simon Kornblith, and Geoffrey Hinton. Published in NeurIPS 2015.
Applicable Settings
Label smoothing is applicable to any problem where targets are a categorical distribution. This includes classification with softmax cross-entropy and segmentation with a Dice loss.
Hyperparameters
alpha
- A value between 0.0 and 1.0 that specifies the interpolation between the target distribution and a uniform distribution. For example. a value of 0.9 specifies that the target values should be multiplied by 0.9 and added to a uniform distribution multiplied by 0.1.
Example Effects
Label smoothing is intended to act as regularization, and so possible effects are changes (ideally improvement) in generalization performance. We find this to be the case on all of our image classification benchmarks, which see improved accuracy under label smoothing.
Implementation Details
Label smoothing replaces the one-hot encoded label with a combination of the true label and the uniform distribution. Care must be taken in ensuring the cost function used can accept the full categorical distribution instead of the index of the target value.
Suggested Hyperparameters
alpha = 0.1
is a standard starting point for label smoothing.
Considerations
In some cases, a small amount of extra memory and compute is needed to convert labels to dense targets. This can produce a (typically negligible) increase in iteration time.
Composability
This method interacts with other methods (such as MixUp) that alter the targets. While such methods may still compose well with label smoothing in terms of improved accuracy, it is important to ensure that the implementations of these methods compose.
Code
- class composer.algorithms.label_smoothing.LabelSmoothing(alpha: float)[source]
Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..
This is computed by
(1 - alpha) * targets + alpha * smoothed_targets
wheresmoothed_targets
is a vector of ones.Introduced in Rethinking the Inception Architecture for Computer Vision.
- Parameters
alpha – Strength of the label smoothing, in [0, 1].
alpha=0
means no label smoothing, andalpha=1
means maximal smoothing (targets are ignored).
- apply(event: composer.core.event.Event, state: composer.core.state.State, logger: composer.core.logging.logger.Logger) Optional[int] [source]
Applies the algorithm to make an in-place change to the State
Can optionally return an exit code to be stored in a
Trace
.- Parameters
event (
Event
) – The current event.state (
State
) – The current state.logger (
Logger
) – A logger to use for logging algorithm-specific metrics.
- Returns
``int`` or ``None`` – exit code that is stored in
Trace
and made accessible for debugging.
- match(event: composer.core.event.Event, state: composer.core.state.State) bool [source]
Determines whether this algorithm should run, given the current
Event
andState
.Examples:
To only run on a specific event:
>>> return event == Event.BEFORE_LOSS
Switching based on state attributes:
>>> return state.epoch > 30 && state.world_size == 1
See
State
for accessible attributes.- Parameters
event (
Event
) – The current event.state (
State
) – The current state.
- Returns
bool – True if this algorithm should run now.
- composer.algorithms.label_smoothing.smooth_labels(logits: torch.Tensor, targets: torch.Tensor, alpha: float)[source]
Shrinks targets towards a uniform distribution to counteract label noise as in Szegedy et al..
This is computed by
(1 - alpha) * targets + alpha * smoothed_targets
wheresmoothed_targets
is a vector of ones.- Parameters
logits – Output of the model. Tensor of shape (N, C, d1, …, dn) for N examples and C classes, and d1, …, dn extra dimensions.
targets – Tensor of shape (N) containing integers 0 <= i <= C-1 specifying the target labels for each example.
alpha – Strength of the label smoothing, in [0, 1].
alpha=0
means no label smoothing, andalpha=1
means maximal smoothing (targets are ignored).