Distributed Training (DDP)
Distributed data parallel, or DDP, is an essential technique for training large models. It is among the most common ways of parallelizing machine learning models. However, as are many things, parallelization in Python is not well supported. To get around these limitations, we offer a composer
launch script to set up multiprocessing and synchronization. This script is directly analogous to the torch.distributed.run
, torchrun
, and deepspeed
scripts that users may be familiar with.
The composer
script is highly recommended for setting up DDP with the Mosaic trainer, and will be necessary to access advanced functionality in the future.
The composer
script wraps your typical setup script. The wrapped script is responsible for setting up a single process’s trainer.
Single-Node Example
In most cases, you will likely be training on multiple CPUs or GPUs on a single machine. Typically in this case, the only argument you need to worry about is the -n
/--nproc
argument to control how many processes should be spawned. When training with GPUs, this value should usually just be the same as the number of GPUs on your system. If using the Mosaic trainer, the trainer will handle the rest of the work.
For example, to train ResNet-50 efficiently with DDP on an 8-GPU system, you can use the following command:
>>> composer -n 8 examples/run_mosaic_trainer.py -f composer/yamls/models/resnet50.yaml
Detailed Usage
A summary of command line arguments can be obtained by typing composer -h
into a command line.
Utility for launching distributed machine learning jobs.
usage: composer [-h] -n NPROC [--world_size WORLD_SIZE]
[--base_rank BASE_RANK] [--master_addr MASTER_ADDR]
[--master_port MASTER_PORT] [--run_directory RUN_DIRECTORY]
[-m]
training_script ...
Positional Arguments
- training_script
The path to the training script used to initialize a single training process. Should be followed by any command-line arguments the script should be launched with.
- training_script_args
Any arguments for the training script, given in the expected order.
Named Arguments
- -n, --nproc
The number of processes to launch on this node. Required.
- --world_size
The total number of processes to launch on all nodes. Set to -1 to default to nproc (single-node operation). Defaults to -1.
Default: -1
- --base_rank
The rank of the lowest ranked process to launch on this node. Specifying a base_rank B and an nproc N will spawn processes with global ranks [B, B+1, … B+N-1]. Defaults to 0 (single-node operation).
Default: 0
- --master_addr
The FQDN of the node running the rank 0 worker. Defaults to 127.0.0.1 (single-node operation).
Default: “127.0.0.1”
- --master_port
The port on the master hosting the C10d TCP store. If you are running multiple trainers on a single node, this generally needs to be unique for each one. Defaults to a random free port.
- --run_directory
- Directory to store run artifcats.
Defaults to runs/{datetime.datetime.now().isoformat()}/
- -m, --module_mode
If set, run the training script as a module instead of as a script.
Default: False