Layer Freezing

Tags: Vision, Decreased Accuracy, Increased GPU Throughput, Method, Backprop, Speedup

TL;DR

Layer Freezing gradually makes early modules not trainable (“freezing” them), saving the cost of backpropagating to and updating frozen modules.

Attribution

Freezing layers is an old and common practice, but our precise freezing scheme most closely resembles:

Freezeout: Accelerate training by progressively freezing layers, by Brock et al. Posted to arXiv on 2017.

and the Freeze Training method of:

SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability by Raghu et al.. Presented at NIPS in 2017.

Hyperparameters

freeze_start: The fraction of epochs to run before freezing begins
freeze_level: The fraction of the modules in the network to freeze by the end of training

Applicable Settings

Layer freezing is in principle applicable to any model with many layers, but the MosaicML implementation currently only supports vision models.

Example Effects

We’ve observed that layer freezing can increase throughput by ~5% for ResNet-50 on ImageNet, but decreases accuracy by 0.5-1%. This is not an especially good speed vs accuracy tradeoff. Existing papers have generally also not found effective tradeoffs on large-scale problems.

For ResNet-56 on CIFAR-100, we have observed an accuracy lift from 75.82% to 76.22% with a similar ~5% speed increase. However, these results used specific hyperparameters without replicates, and so should be interpreted with caution.

Implementation Details

At the end of each epoch after freeze_start, the algorithm traverses the ownership tree of torch.nn.Module objects within one’s model in depth-first order to obtain a list of all modules. Note that this ordering may differ from the order in which modules are actually used in the forward pass.

Given this list of modules, the algorithm computes how many modules to freeze. This number increases linearly over time such that no modules are frozen at freeze_start and a fraction equal to freeze_level are frozen at the end of training.

Modules are frozen by removing their parameters from the optimizer’s param_groups. However, their associated state dict entries are not removed.

Suggested Hyperparameters

freeze_start should be at least 0.1 to allow the network a warmup period.

Considerations

We have yet to observe a significant improvement in the tradeoff between speed and accuracy using this method. However, there may remain other tasks for which the technique works well. Moreover, freezing layers can be useful for understanding a network.

Composability

Layer freezing is a relaxed version of early stopping that stops training the model gradually, rather than all at once. It can therefore be understood as a form of regularization. Combining multiple regularization methods often yields diminishing improvements to accuracy.

Code

class composer.algorithms.layer_freezing.LayerFreezing(freeze_start: float = 0.5, freeze_level: float = 1.0)[source]

Progressively freeze the layers of the network during training, starting with the earlier layers.

Freezing starts after the fraction of epochs specified by freeze_start have run. The fraction of layers frozen increases linearly until it reaches freeze_level at the final epoch.

This freezing schedule is most similar to FreezeOut and Freeze Training.

Runs on Event.EPOCH_END.

Parameters

freeze_start – the fraction of epochs to run before freezing begins
freeze_level – the maximum fraction of layers to freeze

apply(event: Event, state: State, logger: Logger) → Optional[int][source]: Freeze layers in the model

match(event: Event, state: State) → bool[source]: Run on Event.EPOCH_END.

composer.algorithms.layer_freezing.freeze_layers(model: torch.nn.modules.module.Module, optimizers: Union[torch.optim.optimizer.Optimizer, Tuple[torch.optim.optimizer.Optimizer, ...]], current_epoch: int, max_epochs: int, freeze_start: float, freeze_level: float, logger: Optional[Logger] = None) → torch.nn.modules.module.Module[source]

Progressively freeze the layers of the network during training, starting with the earlier layers.

Parameters

model – an instance of the model being trained
optimizers – the optimizers used during training
current_epoch – integer specifying the current epoch
max_epochs – the max number of epochs training will run for
freeze_start – the fraction of epochs to run before freezing begins
freeze_level – the maximum fraction of layers to freeze