composer.trainer.ddp#

Helpers for running distributed data parallel training.

Functions

`ddp_sync_context`	A context manager for handling the `DDPSyncStrategy`.
`prepare_ddp_module`	Wraps the module in a `torch.nn.parallel.DistributedDataParallel` object if running distributed training.

Classes

How and when DDP gradient synchronization should happen.

class composer.trainer.ddp.DDPSyncStrategy(value)[source]#

How and when DDP gradient synchronization should happen.

SINGLE_AUTO_SYNC#: The default behavior for DDP. Gradients are synchronized as they computed, for only the final microbatch of a batch. This is the most efficient strategy, but can lead to errors when find_unused_parameters is set, since it is possible different microbatches may use different sets of parameters, leading to an incomplete sync.

MULTI_AUTO_SYNC#: The default behavior for DDP when find_unused_parameters is set. Gradients are synchronized as they are computed for all microbatches. This ensures complete synchronization, but is less efficient than SINGLE_AUTO_SYNC. This efficiency gap is usually small, as long as either DDP syncs are a small portion of the trainer’s overall runtime, or the number of microbatches per batch is relatively small.

FORCED_SYNC#: Gradients are manually synchronized only after all gradients have been computed for the final microbatch of a batch. Like MULTI_AUTO_SYNC, this strategy ensures complete gradient synchronization, but this tends to be slower than MULTI_AUTO_SYNC. This is because ordinarily syncs can happen in parallel with the loss.backward() computation, meaning syncs can be mostly complete by the time that function finishes. However, in certain circumstances, syncs may take a very long time to complete - if there are also a lot of microbatches per batch, this strategy may be optimal.

composer.trainer.ddp.ddp_sync_context(state, is_final_microbatch, sync_strategy)[source]#

A context manager for handling the DDPSyncStrategy.

Parameters

state (State) – The state of the Trainer.
is_final_microbatch (bool) – Whether or not the context is being used during the final microbatch of the gradient accumulation steps.
sync_strategy (str | DDPSyncStrategy) – The ddp sync strategy to use. If a string is provided, the string must be one of the values in DDPSyncStrategy.

composer.trainer.ddp.prepare_ddp_module(module, find_unused_parameters)[source]#

Wraps the module in a torch.nn.parallel.DistributedDataParallel object if running distributed training.

Parameters

module (Module) – The module to wrap.
find_unused_parameters (bool) – Whether or not to do a pass over the autograd graph to find parameters to not expect gradients for. This is useful if there are some parameters in the model that are not being trained.