👨‍👩‍👧‍👦 Distributed Training#

Composer supports distributed training on multiple devices, whether it be multiple GPUs on a single node, or multiple GPUs across multiple nodes.

Data Parallelism#

Composer distributes work across devices via data-parallelism-only. We choose this in order to provide the most flexibility to algorithms, which can modify the training loop in complex ways. Data parallelism greatly simplifies model building and memory management. Every GPU is performing the same work, so inspecting the rank zero is sufficient to reason about memory, performance, and other properties.

Within Composer, we have two options for data-parallelism-only execution: Pytorch DDP`_ and DeepSpeed Zero. We currently default to Pytorch DDP, though DeepSpeed Zero can provide better performance and lower memory utilization when configured correctly.

Usage#

To launch a multi-GPU training job, we provide the composer launcher:

# run training on 8 GPUs
>>> composer -n 8 my_training_script.py

Under the hood, this script (source code here) sets the required torch.distributed environment variables, launches the processes, and then runs the script on each process.

Note

The batch_size passed to your dataloader should be the per-device minibatch size. We further split this into smaller microbatches with gradient accumulation.

For additional configurations of our launcher script, run composer --help.

usage: composer [-h] -n NPROC [--world_size WORLD_SIZE]
                [--base_rank BASE_RANK] [--node_rank NODE_RANK]
                [--master_addr MASTER_ADDR] [--master_port MASTER_PORT]
                [--run_directory RUN_DIRECTORY] [-m] [-v]
                training_script ...

Positional Arguments#

training_script: The path to the training script used to initialize a single training process. Should be followed by any command-line arguments the script should be launched with.
training_script_args: Any arguments for the training script, given in the expected order.

Named Arguments#

-n, --nproc

The number of processes to launch on this node. Required.

--world_size

The total number of processes to launch on all nodes. Set to -1 to default to nproc (single-node operation). Defaults to -1.

Default: -1

--base_rank

The rank of the lowest ranked process to launch on this node. Specifying a base_rank B and an nproc N will spawn processes with global ranks [B, B+1, … B+N-1]. Defaults to 0 (single-node operation).

Default: 0

--node_rank

The rank of this node. Set to -1 to assume that all nodes have the same number of processes, and calculate accordingly. Defaults to -1.

Default: -1

--master_addr

The FQDN of the node running the rank 0 worker. Defaults to 127.0.0.1 (single-node operation).

Default: “127.0.0.1”

--master_port

The port on the master hosting the C10d TCP store. If you are running multiple trainers on a single node, this generally needs to be unique for each one. Defaults to a random free port.

--run_directory

Directory to store run artifcats. Defaults to runs/{timestamp}/

-m, --module_mode

If set, run the training script as a module instead of as a script.

Default: False

-v, --verbose

If set, print verbose messages

Default: False

Distributed Properties#

Developers may need to access the current rank or world size in a distributed setting. For example, a callback may only want to log something for rank zero. Use our composer.utils.dist module to retrieve this information. The methods are similiar to torch.distributed, but also return defaults in a non-distributed setting.

from composer.utils import dist

dist.get_world_size()  # torch.distributed.get_world_size()
dist.get_local_rank()
dist.get_global_rank()  # torch.distributed.get_rank()

For all retrievable properties, see composer.utils.dist.

Space-time Equivalence#

We consider an equivalency principle between distributed training and gradient accumulation. That is, batches can either be parallelized across space (e.g. devices) or across time (e.g. gradient accumulation). Furthermore, the two dimensions are interchangable – more devices, less gradient accumulation, and vice versa. Our trainer strives to respect this equivalency and ensure identical behavior regardless of the combinations of space and time parallelization used.

Deepspeed#

Composer comes with DeepSpeed support, allowing you to leverage their full set of features that makes it easier to train large models across (1) any type of GPU and (2) multiple nodes. For more details on DeepSpeed, see their website.

We support optimizer and gradient sharing via Deepspeed Zero`_ stages 1 and 2 respectively. In the future, we’ll support model sharding via Zero-3. These methods reduce model state memory by a factor of (1 / the number of data-parallel devices).

To enable DeepSpeed, simply pass in a config as specified in the DeepSpeed docs here.

# run_trainer.py

from composer import Trainer

trainer = Trainer(model=model,
                  train_dataloader=train_dataloader,
                  eval_dataloader=eval_dataloader,
                  max_duration='160ep',
                  device='gpu',
                  deepspeed_config={
                      "train_batch_size": 2048,
                      "fp16": {"enabled": True},
                  })

Providing an empty dictionary to deepspeed is also valid. The deepspeed defaults will be used and other fields (such as precision) inferred from the trainer.

Warning

The deepspeed_config must not conflict with any other parameters passed to the trainer.

Warning

Not all algorithms have been tested with Deepspeed, please proceed with caution.