Layer Freezing
Tags: Vision
, Decreased Accuracy
, Increased GPU Throughput
, Method
, Backprop
, Speedup
TL;DR
Layer Freezing gradually makes early modules not trainable (“freezing” them), saving the cost of backpropagating to and updating frozen modules.
Attribution
Freezing layers is an old and common practice, but our precise freezing scheme most closely resembles:
Freezeout: Accelerate training by progressively freezing layers, by Brock et al. Posted to arXiv on 2017.
and the Freeze Training method of:
SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability by Raghu et al.. Presented at NIPS in 2017.
Hyperparameters
freeze_start
: The fraction of epochs to run before freezing beginsfreeze_level
: The fraction of the modules in the network to freeze by the end of training
Applicable Settings
Layer freezing is in principle applicable to any model with many layers, but the MosaicML implementation currently only supports vision models.
Example Effects
We’ve observed that layer freezing can increase throughput by ~5% for ResNet-50 on ImageNet, but decreases accuracy by 0.5-1%. This is not an especially good speed vs accuracy tradeoff. Existing papers have generally also not found effective tradeoffs on large-scale problems.
For ResNet-56 on CIFAR-100, we have observed an accuracy lift from 75.82% to 76.22% with a similar ~5% speed increase. However, these results used specific hyperparameters without replicates, and so should be interpreted with caution.
Implementation Details
At the end of each epoch after freeze_start
, the algorithm traverses the ownership tree of torch.nn.Module
objects within one’s model in depth-first order to obtain a list of all modules. Note that this ordering may differ from the order in which modules are actually used in the forward pass.
Given this list of modules, the algorithm computes how many modules to freeze. This number increases linearly over time such that no modules are frozen at freeze_start
and a fraction equal to freeze_level
are frozen at the end of training.
Modules are frozen by removing their parameters from the optimizer’s param_groups
. However, their associated state dict
entries are not removed.
Suggested Hyperparameters
freeze_start
should be at least 0.1 to allow the network a warmup period.
Considerations
We have yet to observe a significant improvement in the tradeoff between speed and accuracy using this method. However, there may remain other tasks for which the technique works well. Moreover, freezing layers can be useful for understanding a network.
Composability
Layer freezing is a relaxed version of early stopping that stops training the model gradually, rather than all at once. It can therefore be understood as a form of regularization. Combining multiple regularization methods often yields diminishing improvements to accuracy.
Code
- class composer.algorithms.layer_freezing.LayerFreezing(freeze_start: float = 0.5, freeze_level: float = 1.0)[source]
Progressively freeze the layers of the network during training, starting with the earlier layers.
Freezing starts after the fraction of epochs specified by
freeze_start
have run. The fraction of layers frozen increases linearly until it reachesfreeze_level
at the final epoch.This freezing schedule is most similar to FreezeOut and Freeze Training.
Runs on
Event.EPOCH_END
.- Parameters
freeze_start – the fraction of epochs to run before freezing begins
freeze_level – the maximum fraction of layers to freeze
- composer.algorithms.layer_freezing.freeze_layers(model: torch.nn.modules.module.Module, optimizers: Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]], current_epoch: int, max_epochs: int, freeze_start: float, freeze_level: float, logger: Optional[Logger] = None) torch.nn.modules.module.Module [source]
Progressively freeze the layers of the network during training, starting with the earlier layers.
- Parameters
model – an instance of the model being trained
optimizers – the optimizers used during training
current_epoch – integer specifying the current epoch
max_epochs – the max number of epochs training will run for
freeze_start – the fraction of epochs to run before freezing begins
freeze_level – the maximum fraction of layers to freeze