Stochastic Depth (Sample-Wise)
Tags: Method
, Networks with Residual Connections
, Method
, Regularization
, Increased Accuracy
TL;DR
Sample-wise stochastic depth is a regularization technique for networks with residual connections that probabilistically drops samples after the transformation function in each residual block. This means that different samples go through different combinations of blocks.
Attribution
EfficientNet model in the TPU Github repository from Google
EfficientNet model in gen-efficientnet-pytorch Github repository by Ross Wightman
Hyperparameters
stochastic_method
- Specifies the version of the stochastic depth method to use.stochastic_method=sample
applies stochastic dropping to samples.stochastic_method=block
applies block-wise stochastic depth, which we address in a separate method card.target_layer_name
- The reference name for the module that will be replaced with a functionally equivalent sample-wise stochastic block. For example,target_layer_name=ResNetBottleNeck
will replace modules in the model namedBottleNeck
.drop_rate
- The probability of dropping a sample within a residual block.drop_distribution
- How thedrop_rate
is distributed across the model’s blocks. The two possible values areuniform
andlinear
.uniform
assigns a singledrop_rate
across all blocks.linear
linearly increases the drop rate according to the block’s depth, starting from 0 at the first block and ending withdrop_rate
at the last block.
Applicable Settings
Sample-wise stochastic depth requires models to have residual blocks since the method relies on skip connections to allow samples to skip blocks of the network.
Example Effects
For both ResNet-50 and ResNet-101 on ImageNet, we measure a +0.4% absolute accuracy improvement when using drop_rate=0.1
and drop_distribution=linear
. The training wall-clock time is approximately 5% longer when using sample-wise stochastic depth.
Implementation Details
When training, samples are dropped after the transformation function in a residual block by multiplying the batch by a binary vector. The binary vector is generated by sampling independent Bernoulli distributions with probability (1 - drop_rate
). After the samples are dropped, the skip connection is added as usual. During inference, no samples are dropped, but the batch of samples is scaled by (1 - drop_rate
) to compensate for the drop frequency when training.
Suggested Hyperparameters
We observed that drop_rate=0.1
and drop_distribution=linear
yielded maximum accuracy improvements on both ResNet-50 and ResNet-101.
Considerations
Because sample-wise stochastic depth randomly drops samples within each residual block, a shallow model may exhibit instability due to insufficient transformation on some samples. When using a shallow model, it is best to use a small drop rate or avoid sample-wise stochastic depth entirely.
In addition, there may be instability when training on smaller batch sizes since a significant proportion of the batch may be dropped even at low drop rates.
Composability
Combining several regularization methods may have diminishing returns, and can even degrade accuracy. This may hold true when combining sample-wise stochastic depth with other regularization methods.
Code
- class composer.algorithms.stochastic_depth.StochasticDepth(stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', drop_warmup=0.0, use_same_gpu_seed=True)[source]
Algorithm to replace a specified block with a stochastic version of the block.
The stochastic block will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.
- Parameters
stochastic_method – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Which block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or a sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be either ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
drop_warmup (float) – Percentage of training epochs to linearly increase the drop probability to linear_drop_rate. Must be between 0.0 and 1.0.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set false to have each GPU drop a different set of layers. Only used with “block” stochastic method.
- apply(event, state, logger)[source]
Applies the algorithm to make an in-place change to the State
Can optionally return an exit code to be stored in a
Trace
.- Parameters
event (
Event
) – The current event.state (
State
) – The current state.logger (
Logger
) – A logger to use for logging algorithm-specific metrics.
- Returns
int or None – exit code that is stored in
Trace
and made accessible for debugging.- Return type
Optional[int]
- match(event, state)[source]
Apply on Event.INIT and Event.BATCH_START if drop_warmup is > 0.0
- Parameters
event (composer.core.event.Event) –
state (composer.core.state.State) –
- Return type
- composer.algorithms.stochastic_depth.apply_stochastic_depth(model, stochastic_method, target_layer_name, drop_rate=0.2, drop_distribution='linear', use_same_gpu_seed=True)[source]
Applies Stochastic Depth algorithm to the specified model.
The algorithm replaces the specified target layer with a stochastic version of the layer. The stochastic layer will randomly drop either samples or the layer itself depending on the stochastic method specified. The layer-wise version follows the original paper <https://arxiv.org/abs/1603.09382>`_. The sample-wise version follows the implementation used for EfficientNet in the Tensorflow/TPU repo: <https://github.com/tensorflow/tpu>`_.
- Parameters
stochastic_method (str) – The version of stochastic depth to use. “block” randomly drops blocks during training. “sample” randomly drops samples within a block during training.
target_layer_name (str) – Block to replace with a stochastic block equivalent. The name must be registered in STOCHASTIC_LAYER_MAPPING dictionary with the target layer class and the stochastic layer class. Currently, must be one of [‘ResNetBottleneck’].
drop_rate (float) – The base probability of dropping a layer or sample. Must be between 0.0 and 1.0.
drop_distribution (str) – How drop_rate is distributed across layers. Value must be one of ‘uniform’ or ‘linear’. ‘uniform’ assigns the same drop_rate across all layers. ‘linear’ linearly increases the drop rate across layer depth starting with 0 drop rate and ending with drop_rate.
use_same_gpu_seed (bool) – Set to true to have the same layers dropped across GPUs when using multi-GPU training. Set to false to have each GPU drop a different set of layers. Only used with “block” stochastic method.
model (torch.nn.modules.module.Module) –
- Return type