composer.trainer.ddp#

Helpers for running distributed data parallel training.

Classes

How and when DDP gradient synchronization should happen.

class composer.trainer.ddp.DDPSyncStrategy(value)[source]#

How and when DDP gradient synchronization should happen.

SINGLE_AUTO_SYNC#: The default behavior for DDP. Gradients are synchronized as they computed, for only the final microbatch of a batch. This is the most efficient strategy, but can lead to errors when find_unused_parameters is set, since it is possible different microbatches may use different sets of parameters, leading to an incomplete sync.

MULTI_AUTO_SYNC#: The default behavior for DDP when find_unused_parameters is set. Gradients are synchronized as they are computed for all microbatches. This ensures complete synchronization, but is less efficient than SINGLE_AUTO_SYNC. This efficiency gap is usually small, as long as either DDP syncs are a small portion of the trainer’s overall runtime, or the number of microbatches per batch is relatively small.

FORCED_SYNC#: Gradients are manually synchronized only after all gradients have been computed for the final microbatch of a batch. Like MULTI_AUTO_SYNC, this strategy ensures complete gradient synchronization, but this tends to be slower than MULTI_AUTO_SYNC. This is because ordinarily syncs can happen in parallel with the loss.backward() computation, meaning syncs can be mostly complete by the time that function finishes. However, in certain circumstances, syncs may take a very long time to complete - if there are also a lot of microbatches per batch, this strategy may be optimal.