composer.callbacks.checkpoint_saver#

Callback to save checkpoints during training.

Functions

checkpoint_periodically

Helper function to create a checkpoint scheduler according to a specified interval.

Classes

CheckpointSaver

Callback to save checkpoints.

class composer.callbacks.checkpoint_saver.CheckpointSaver(folder='{run_name}/checkpoints', filename='ep{epoch}-ba{batch}-rank{rank}', artifact_name='{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}', latest_filename='latest-rank{rank}', latest_artifact_name='{run_name}/checkpoints/latest-rank{rank}', save_interval='1ep', *, overwrite=False, num_checkpoints_to_keep=- 1, weights_only=False)[source]#

Bases: composer.core.callback.Callback

Callback to save checkpoints.

Note

If the folder argument is specified constructing the Trainer, then the CheckpointSaver callback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of this CheckpointSaver callback can be specified in the callbacks argument of the Trainer, as shown in the example below.

Example

>>> trainer = Trainer(..., callbacks=[
...     CheckpointSaver(
...         folder='{run_name}/checkpoints',
...         filename="ep{epoch}-ba{batch}-rank{rank}",
...         latest_filename="latest-rank{rank}",
...         save_interval="1ep",
...         weights_only=False,
...     )
... ])

Parameters

folder (str, optional) –

Format string for the folder where checkpoints will be saved. (default: '{run_name}/checkpoints')

The following format variables are available:

Variable	Description
`{run_name}`	The name of the training run. See `run_name`.
`{rank}`	The global rank, as returned by `get_global_rank()`.
`{local_rank}`	The local rank of the process, as returned by `get_local_rank()`.
`{world_size}`	The world size, as returned by `get_world_size()`.
`{local_world_size}`	The local world size, as returned by `get_local_world_size()`.
`{node_rank}`	The node rank, as returned by `get_node_rank()`.

Note

When training with multiple devices (i.e. GPUs), ensure that '{rank}' appears in the format. Otherwise, multiple processes may attempt to write to the same file.

filename (str, optional) –

A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}-rank{rank}')

Checkpoints will be saved approximately to {folder}/{filename.format(...)}.

The following format variables are available:

Variable	Description
`{run_name}`	The name of the training run. See `run_name`.
`{rank}`	The global rank, as returned by `get_global_rank()`.
`{local_rank}`	The local rank of the process, as returned by `get_local_rank()`.
`{world_size}`	The world size, as returned by `get_world_size()`.
`{local_world_size}`	The local world size, as returned by `get_local_world_size()`.
`{node_rank}`	The node rank, as returned by `get_node_rank()`.
`{epoch}`	The total epoch count, as returned by `epoch()`.
`{batch}`	The total batch count, as returned by `batch()`.
`{batch_in_epoch}`	The batch count in the current epoch, as returned by `batch_in_epoch()`.
`{sample}`	The total sample count, as returned by `sample()`.
`{sample_in_epoch}`	The sample count in the current epoch, as returned by `sample_in_epoch()`.
`{token}`	The total token count, as returned by `token()`.
`{token_in_epoch}`	The token count in the current epoch, as returned by `token_in_epoch()`.

Note

By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the filename. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, '.tar' will be used.
To use compression (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bzip', or '.tar.lzma' (depending on the desired compression algorithm).

Warning

Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.

Consider the following scenario, where:

The run_name is 'awesome-training-run'
The default folder='{run_name}/checkpoints' is used.
The default name='ep{epoch}-ba{batch}-rank{rank}' is used.
The current epoch count is 1.
The current batch count is 42.

When DeepSpeed is not being used, the rank zero process will save the checkpoint to "awesome-training-run/checkpoints/ep1-ba42-rank0".

When DeepSpeed is being used, each rank (process) will save checkpoints to:

awesome-training-run/checkpoints/ep1-ba42-rank0.tar
awesome-training-run/checkpoints/ep1-ba42-rank1.tar
awesome-training-run/checkpoints/ep1-ba42-rank2.tar
...

artifact_name (str, optional) –
Format string for the checkpoint’s artifact name. (default: '{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}")

After the checkpoint is saved, it will be periodically logged as a file artifact. The artifact name will be determined by this format string.

See also

log_file_artifact() for file artifact logging.

The same format variables for filename are available.

Leading slashes ('/') will be stripped.

To disable logging trace files as file artifacts, set this parameter to None.
latest_filename (str, optional) –
A format string for a symlink which points to the last saved checkpoint. (default: 'latest-rank{rank}')

Symlinks will be created approximately at {folder}/{latest_filename.format(...)}.

The same format variables as for name are available.

To disable symlinks, set this parameter to None.

Consider the following scenario, where:
- The run_name is ‘awesome-training-run’
- The default folder='{run_name}/checkpoints' is used.
- The default name='ep{epoch}-ba{batch}-rank{rank}' is used.
- The default latest_filename='latest-rank{rank}' is used.
- The current epoch count is 1.
- The current batch count is 42.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'awesome-training-run/checkpoints/ep1-ba42-rank0', and a symlink will be created at 'awesome-training-run/checkpoints/latest-rank0' -> 'awesome-training-run/checkpoints/ep1-ba42-rank0'

When DeepSpeed is being used, each rank (process) will save checkpoints to:
```
awesome-training-run/checkpoints/ep1-ba42-rank0.tar
awesome-training-run/checkpoints/ep1-ba42-rank1.tar
awesome-training-run/checkpoints/ep1-ba42-rank2.tar
...
```
Corresponding symlinks will be created at:
```
awesome-training-run/checkpoints/latest-rank0.tar -> awesome-training-run/checkpoints/ep1-ba42-rank0.tar
awesome-training-run/checkpoints/latest-rank1.tar -> awesome-training-run/checkpoints/ep1-ba42-rank1.tar
awesome-training-run/checkpoints/latest-rank2.tar -> awesome-training-run/checkpoints/ep1-ba42-rank2.tar
...
```
latest_artifact_name (str, optional) –
Format string for the checkpoint’s latest symlink artifact name. (default: '{run_name}/checkpoints/latest-rank{rank}")

Whenever a new checkpoint is saved, a symlink artifact is created or updated to point to the latest checkpoint’s artifact_name. The artifact name will be determined by this format string. This parameter has no effect if latest_filename or artifact_name is None.”

See also

log_symlink_artifact() for symlink artifact logging.

The same format variables for filename are available.

Leading slashes ('/') will be stripped.

To disable symlinks in logger, set this parameter to None.
overwrite (bool, optional) – Whether existing checkpoints should be overridden. If False (the default), then the folder must not exist or be empty. (default: False)
save_interval (Time | str | int | (State, Event) -> bool) –
A Time, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved.

If an integer, checkpoints will be saved every n epochs. If Time or a time-string, checkpoints will be saved according to this interval.

See also

checkpoint_periodically()

If a function, then this function should take two arguments (State, Event). The first argument will be the current state of the trainer, and the second argument will be be Event.BATCH_CHECKPOINT or EPOCH_CHECKPOINT (depending on the current training progress). It should return True if a checkpoint should be saved given the current state and event.
weights_only (bool) – If True, save only the model weights instead of the entire training state. This parmeter must be False when using DeepSpeed. (default: False)
num_checkpoints_to_keep (int, optional) –
The number of checkpoints to keep locally. The oldest checkpoints are removed first. Set to -1 to keep all checkpoints locally. (default: -1)

Checkpoints will be removed after they have been logged as a file artifact. For example, when this callback is used in conjunction with the ObjectStoreLogger, set this parameter to 0 to immediately delete checkpoints from the local disk after they have been uploaded to the object store.

This parameter only controls how many checkpoints are kept locally; checkpoints are not deleted from artifact stores.

saved_checkpoints#

The checkpoint timestamps and filepaths.

This list contains tuples of the save timestamp and the checkpoint filepaths. This list will have at most num_checkpoints_to_keep entries. The latest checkpoint will be at the end.

Note

When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the process’s (rank’s) node.

Otherwise, when not using DeepSpeed, each sub-list will contain only one filepath since only rank zero saves checkpoints.

Type: List[Tuple[Timestamp, List[Path]]]

composer.callbacks.checkpoint_saver.checkpoint_periodically(interval)[source]#

Helper function to create a checkpoint scheduler according to a specified interval.

Parameters

interval (Union[str, int, Time]) –

The interval describing how often checkpoints should be saved. If an integer, it will be assumed to be in EPOCHs. Otherwise, the unit must be either TimeUnit.EPOCH or TimeUnit.BATCH.

Checkpoints will be saved every n batches or epochs (depending on the unit), and at the end of training.

Returns

Callable[[State, Event], bool] – A function that can be passed as the save_interval argument into the CheckpointSaver.