checkpoint#

Utilities for working with training checkpoints.

Functions

`download_checkpoint`	Download the checkpoint stored at `path`, potentially in `object_store`, to `node_checkpoint_folder`.
`load_checkpoint`	Load a checkpoint from a local file, URI, or cloud object store into `state`.
`save_checkpoint`	Checkpoint the training `state`.

composer.utils.checkpoint.download_checkpoint(path, node_checkpoint_folder, object_store, progress_bar)[source]#

Download the checkpoint stored at path, potentially in object_store, to node_checkpoint_folder.

Returns a tuple of (composer_states_filepath, extracted_checkpoint_folder, extracted_rank_n).

The composer_states_filepath, is the path to the composer states, which can be passed into torch.load().
The extracted_checkpoint_folder is the path to the checkpoint folder, which can be passed into deepspeed.DeepSpeedEngine.load_checkpoint().
The extracted_rank_n is a boolean flag indicating whether a tarball was extracted on global rank greater than 0.

composer.utils.checkpoint.glob_filter(exclude_globs)[source]#: Provides a function which deletes all subparts of a dictionary based on a list of paths.

composer.utils.checkpoint.load_checkpoint(path, state, object_store=None, load_weights_only=False, strict_model_weights=False, progress_bar=True, ignore_keys=None)[source]#

Load a checkpoint from a local file, URI, or cloud object store into state.

Parameters

path (str) –

The path format string to an existing checkpoint file.

It can be a path to a file on the local disk, a URL, or if object_store is set, the object name for a checkpoint in a cloud bucket.

When using Deepspeed ZeRO, checkpoints are shareded by rank. Instead of hard-coding the rank in the path, use the following format variables:

Variable	Description
`{rank}`	The global rank, as returned by `get_global_rank()`.
`{local_rank}`	The local rank of the process, as returned by `get_local_rank()`.
`{node_rank}`	The node rank, as returned by `get_node_rank()`.

For example, suppose that checkpoints are stored in the following structure:

my_model/ep1-rank0.tar
my_model/ep1-rank1.tar
my_model/ep1-rank2.tar
...

Then, path should be set to my_model/ep1-rank{rank}.tar, and all ranks will load the correct state.

state (State) – The State to load the checkpoint into.
object_store (Union[ObjectStore, LoggerDestination], optional) – If the path is in an object store (i.e. AWS S3 or Google Cloud Storage), an instance of ObjectStore or LoggerDestination which will be used to retreive the checkpoint. Otherwise, if the checkpoint is a local filepath, set to None. (default: None)
load_weights_only (bool, optional) – Whether or not to only restore the model weights from the checkpoint without restoring the associated state. (default: False)
strict_model_weights (bool, optional) – Whether or not to force that the checkpointed weights must exactly match the model weights. (default: False)
progress_bar (bool, optional) – Whether or not to show a progress bar when downloading checkpoints. Ignored if the checkpoint is a local file path. (default: True)
ignore_keys (List[str] | (Dict) -> None, optional) –
A list of paths for the state_dict of the checkpoint, which, when provided, will be ignored from the state_dict before a checkpoint is loaded. Each path is a list of strings specifying the keys to index into state_dict joined together with / as a seperator (as PyTorch uses . in parameter names). If a prefix is provided, all children are also ignored (see Example 2). See composer.core.state for the structure of state_dict.

Example 1: ignore_keys = ["state/model/layer1.weights", "state/model/layer1.bias"] would ignore layer 1 weights and bias.

Example 2: ignore_keys = ["state/model/*"] would ignore the entire model, which would have the same effect as the previous example if there was only 1 layer.

Example 3: ignore_keys = ["state/model/layer*.weights"] would ignore all weights in the model.

Example 4: ignore_keys = ["state/rank_zero_seed", "rng"] would reset all randomness when loading the checkpoint.

If a callable, it should take one argument which is the state_dict. The callable is free to arbitrarily modify the state_dict before it is loaded.

(default: None)

Returns

Optional[List[Dict[str, Any]]] – The RNG state dicts, indexed by global rank, if load_weights_only is not None. Otherwise, None.

composer.utils.checkpoint.save_checkpoint(state, filename='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False)[source]#

Checkpoint the training state.

Parameters

state (State) – The training state.
logger (Logger) – The logger.

filename (str) –

A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}-rank{rank}')

The following format variables are available:

Variable	Description
`{run_name}`	The name of the training run. See `Logger.run_name`.
`{rank}`	The global rank, as returned by `get_global_rank()`.
`{local_rank}`	The local rank of the process, as returned by `get_local_rank()`.
`{world_size}`	The world size, as returned by `get_world_size()`.
`{local_world_size}`	The local world size, as returned by `get_local_world_size()`.
`{node_rank}`	The node rank, as returned by `get_node_rank()`.
`{epoch}`	The total epoch count, as returned by `epoch()`.
`{batch}`	The total batch count, as returned by `batch()`.
`{batch_in_epoch}`	The batch count in the current epoch, as returned by `batch_in_epoch()`.
`{sample}`	The total sample count, as returned by `sample()`.
`{sample_in_epoch}`	The sample count in the current epoch, as returned by `sample_in_epoch()`.
`{token}`	The total token count, as returned by `token()`.
`{token_in_epoch}`	The token count in the current epoch, as returned by `token_in_epoch()`.
`{total_wct}`	The total training duration in seconds, as returned by `total_wct()`.
`{epoch_wct}`	The epoch duration in seconds, as returned by `epoch_wct()`.
`{batch_wct}`	The batch duration in seconds, as returned by `batch_wct()`.

Note

By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the filename. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, .tar will be used.
To use compression (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bzip', or '.tar.lzma' (depending on the desired compression algorithm).

Warning

Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.

Consider the following scenario, where:

The default name='ep{epoch}-ba{batch}-rank{rank}' is used.
The current epoch count is 1.
The current batch count is 42.

When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'ep1-ba42-rank0'. When DeepSpeed is being used, each rank (process) will save checkpoints to:

ep1-ba42-rank0.tar
ep1-ba42-rank1.tar
ep1-ba42-rank2.tar
...

weights_only (bool, optional) –
If True, save only the model weights instead of the entire training state. (default: False)

Note

When using DeepSpeed, this parameter must be False. Weights-only checkpointing is not currently compatible with DeepSpeed,
Returns –
List[pathlib.Path]: The list of checkpoint files saved, indexed by the rank of the process.

Note

When using DeepSpeed, each process (rank) saves its own checkpoint file. When doing multi-node training, the filepaths are valid only on each process’s node; Composer does not move checkpoint files between nodes.

Otherwise, when not using DeepSpeed, each list will contain only one filepath, since only the rank zero process saves checkpoints.