composer.callbacks.checkpoint_saver#

Callback to save checkpoints during training.

Functions

checkpoint_periodically

Helper function to create a checkpoint scheduler according to a specified interval.

Classes

CheckpointSaver

Callback to save checkpoints.

class composer.callbacks.checkpoint_saver.CheckpointSaver(save_folder='checkpoints', name_format='ep{epoch}-ba{batch}/rank_{rank}', save_latest_format='latest/rank_{rank}', overwrite=False, save_interval='1ep', weights_only=False)[source]#

Bases: composer.core.callback.Callback

Callback to save checkpoints.

Note

If the save_folder argument is specified constructing the Trainer, then the CheckpointSaver callback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of this CheckpointSaver callback can be specified in the callbacks argument of the Trainer, as shown in the example below.

Example

>>> trainer = Trainer(..., callbacks=[
...     CheckpointSaver(
...         save_folder='checkpoints',
...         name_format="ep{epoch}-ba{batch}/rank_{rank}",
...         save_latest_format="latest/rank_{rank}",
...         save_interval="1ep",
...         weights_only=False,
...     )
... ])
Parameters
  • save_folder (str) โ€“

    Folder where checkpoints are saved.

    If an absolute path is specified, then that path will be used. Otherwise, the save_folder will be relative to the folder returned by get_run_directory(). If the save_folder does not exist, it will be created.

  • name_format (str, optional) โ€“

    A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}/rank_{rank}')

    Checkpoints will be saved approximately to {save_folder}/{name_format.format(...)}.

    See format_name() for the available format variables.

    Note

    • By default, only the rank zero process will save a checkpoint file.

    • When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the name_format_string. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, '.tar' will be used.

    • To use compression (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bzip', or '.tar.lzma' (depending on the desired compression algorithm).

    Warning

    Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.

    Consider the following scenario, where:

    • The default save_folder='checkpoints' is used.

    • The default name_format='ep{epoch}-ba{batch}/rank_{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    When DeepSpeed is not being used, the rank zero process will save the checkpoint to "checkpoints/ep1-ba42/rank_0".

    When DeepSpeed is being used, each rank (process) will save checkpoints to:

    checkpoints/ep1-ba42/rank_0.tar
    checkpoints/ep1-ba42/rank_1.tar
    checkpoints/ep1-ba42/rank_2.tar
    ...
    

  • save_latest_format (str, optional) โ€“

    A format string for a symlink which points to the last saved checkpoint. (default: 'latest/rank_{rank}')

    Symlinks will be created approximately at {save_folder}/{save_latest_format.format(...)}.

    See format_name() for the available format variables.

    To disable symlinks, set this parameter to None.

    Consider the following scenario, where:

    • The default save_folder='checkpoints' is used.

    • The default name_format='ep{epoch}-ba{batch}/rank_{rank}' is used.

    • The default save_latest_format='latest/rank_{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'checkpoints/ep1-ba42/rank_0', and a symlink will be created at 'checkpoints/latest/rank_0' -> 'checkpoints/ep1-ba42/rank_0'

    When DeepSpeed is being used, each rank (process) will save checkpoints to:

    checkpoints/ep1-ba42/rank_0.tar
    checkpoints/ep1-ba42/rank_1.tar
    checkpoints/ep1-ba42/rank_2.tar
    ...
    

    Corresponding symlinks will be created at:

    checkpoints/latest/rank_0.tar -> checkpoints/ep1-ba42/rank_0.tar
    checkpoints/latest/rank_1.tar -> checkpoints/ep1-ba42/rank_1.tar
    checkpoints/latest/rank_2.tar -> checkpoints/ep1-ba42/rank_2.tar
    ...
    

  • overwrite (bool, optional) โ€“ Whether existing checkpoints should be overridden. If False (the default), then the checkpoint_folder must not exist or be empty. (default: False)

  • save_interval (Time | str | int | (State, Event) -> bool) โ€“

    A Time, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved.

    If an integer, checkpoints will be saved every n epochs. If Time or a time-string, checkpoints will be saved according to this interval.

    If a function, then this function should take two arguments (State, Event). The first argument will be the current state of the trainer, and the second argument will be be Event.BATCH_CHECKPOINT or EPOCH_CHECKPOINT (depending on the current training progress). It should return True if a checkpoint should be saved given the current state and event.

  • weights_only (bool) โ€“ If True, save only the model weights instead of the entire training state. This parmeter must be False when using DeepSpeed. (default: False)

checkpoint_folder#

The folder in which checkpoints are stored. If an absolute path was specified for save_folder upon instantiation, then that path will be used. Otherwise, this folder is relative to the run directory of the training run (e.g. {run_directory}/{save_folder}). If no run directory is provided, then by default, it is of the form runs/<timestamp>/rank_<GLOBAL_RANK>/<save_folder> where timestamp is the start time of the run in iso-format, GLOBAL_RANK is the global rank of the process, and save_folder is the save_folder argument provided upon construction.

See also

run_directory for details on the format of the run directory and how to customize it.

Type

str

saved_checkpoints#

A dictionary mapping a save timestamp to a list of filepaths corresponding to the checkpoints saved at that time.

Note

When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. These filepaths are valid only on the global rankโ€™s node. Otherwise, when not using DeepSpeed, this list will contain only one filepath since only rank zero saves checkpoints.

Type

Dict[Timestamp, List[str]]

composer.callbacks.checkpoint_saver.checkpoint_periodically(interval)[source]#

Helper function to create a checkpoint scheduler according to a specified interval.

Parameters

interval (Union[str, int, Time]) โ€“

The interval describing how often checkpoints should be saved. If an integer, it will be assumed to be in EPOCHs. Otherwise, the unit must be either TimeUnit.EPOCH or TimeUnit.BATCH.

Checkpoints will be saved every n batches or epochs (depending on the unit), and at the end of training.

Returns

Callable[[State, Event], bool] โ€“ A function that can be passed as the save_interval argument into the CheckpointSaver.