CheckpointSaver#

class composer.callbacks.CheckpointSaver(folder='{run_name}/checkpoints', filename='ep{epoch}-ba{batch}-rank{rank}.pt', remote_file_name='{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}.pt', latest_filename='latest-rank{rank}.pt', latest_remote_file_name='{run_name}/checkpoints/latest-rank{rank}.pt', save_interval='1ep', *, overwrite=False, num_checkpoints_to_keep=- 1, weights_only=False, ignore_keys=None)[source]#

Callback to save checkpoints.

Note

If the folder argument is specified when constructing the Trainer, then the CheckpointSaver callback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of this CheckpointSaver callback can be specified in the callbacks argument of the Trainer, as shown in the example below.

Example

>>> trainer = Trainer(..., callbacks=[
...     CheckpointSaver(
...         folder='{run_name}/checkpoints',
...         filename="ep{epoch}-ba{batch}-rank{rank}",
...         latest_filename="latest-rank{rank}",
...         save_interval="1ep",
...         weights_only=False,
...     )
... ])
Parameters
  • folder (str, optional) โ€“

    Format string for the save_folder where checkpoints will be saved. Default: '{run_name}/checkpoints'.

    The following format variables are available:

    Variable

    Description

    {run_name}

    The name of the training run. See Logger.run_name.

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {world_size}

    The world size, as returned by get_world_size().

    {local_world_size}

    The local world size, as returned by get_local_world_size().

    {node_rank}

    The node rank, as returned by get_node_rank().

    Note

    When training with multiple devices (i.e. GPUs), ensure that '{rank}' appears in the format. Otherwise, multiple processes may attempt to write to the same file.

  • filename (str, optional) โ€“

    A format string describing how to name checkpoints. Default: 'ep{epoch}-ba{batch}-rank{rank}.pt'.

    Checkpoints will be saved approximately to {folder}/{filename.format(...)}.

    The following format variables are available:

    Variable

    Description

    {run_name}

    The name of the training run. See Logger.run_name.

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {world_size}

    The world size, as returned by get_world_size().

    {local_world_size}

    The local world size, as returned by get_local_world_size().

    {node_rank}

    The node rank, as returned by get_node_rank().

    {epoch}

    The total epoch count, as returned by epoch().

    {batch}

    The total batch count, as returned by batch().

    {batch_in_epoch}

    The batch count in the current epoch, as returned by batch_in_epoch().

    {sample}

    The total sample count, as returned by sample().

    {sample_in_epoch}

    The sample count in the current epoch, as returned by sample_in_epoch().

    {token}

    The total token count, as returned by token().

    {token_in_epoch}

    The token count in the current epoch, as returned by token_in_epoch().

    {total_wct}

    The total training duration in seconds, as returned by total_wct().

    {epoch_wct}

    The epoch duration in seconds, as returned by epoch_wct().

    {batch_wct}

    The batch duration in seconds, as returned by batch_wct().

    Note

    • By default, only the rank zero process will save a checkpoint file.

    • When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the filename. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, '.tar' will be used.

    • To write to compressed tar files (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bz2', or '.tar.lzma' (depending on the desired compression algorithm).

    • To write to compressed pt files (when DeepSpeed is disabled), set the file extension to '.pt.bz2', '.pt.gz', '.pt.lz4', '.pt.lzma', '.pt.lzo', '.pt.xz', '.pt.zst' (depending on the desired algorithm). You must have the corresponding CLI tool installed. lz4 is a good choice for a modest space saving while being very fast to compress.

    Warning

    Using compression will block the training loop while checkpoints are being compressed and the compressibility of checkpoints can vary significantly depending on your setup. As such, we recommend saving checkpoints without compression by default.

    If you have the lz4 command available on your system, you may want to try saving as .pt.lz4 as the overhead is minimal (usually less than a second) and the saved space can sometimes be significant (1% - 40%).

    Consider the following scenario where:

    • The run_name is 'awesome-training-run'

    • The default folder='{run_name}/checkpoints' is used.

    • The default name='ep{epoch}-ba{batch}-rank{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    When DeepSpeed is not being used, the rank zero process will save the checkpoint to "awesome-training-run/checkpoints/ep1-ba42-rank0".

    When DeepSpeed is being used, each rank (process) will save checkpoints to:

    awesome-training-run/checkpoints/ep1-ba42-rank0.tar
    awesome-training-run/checkpoints/ep1-ba42-rank1.tar
    awesome-training-run/checkpoints/ep1-ba42-rank2.tar
    ...
    

  • remote_file_name (str, optional) โ€“

    Format string for the checkpointโ€™s remote file name. Default: "{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}".

    After the checkpoint is saved, it will be periodically uploaded. The remote file name will be determined by this format string.

    See also

    Uploading Files for notes for file uploading.

    The same format variables for filename are available.

    Leading slashes ('/') will be stripped.

    To disable uploading checkpoints, set this parameter to None.

  • latest_filename (str, optional) โ€“

    A format string for a symlink which points to the last saved checkpoint. Default: 'latest-rank{rank}.pt'.

    Symlinks will be created approximately at {folder}/{latest_filename.format(...)}.

    The same format variables as for name are available.

    To disable symlinks, set this parameter to None.

    Consider the following scenario, where:

    • The run_name is โ€˜awesome-training-runโ€™

    • The default folder='{run_name}/checkpoints' is used.

    • The default name='ep{epoch}-ba{batch}-rank{rank}' is used.

    • The default latest_filename='latest-rank{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'awesome-training-run/checkpoints/ep1-ba42-rank0', and a symlink will be created at 'awesome-training-run/checkpoints/latest-rank0' -> 'awesome-training-run/checkpoints/ep1-ba42-rank0'

    When DeepSpeed is being used, each rank (process) will save checkpoints to:

    awesome-training-run/checkpoints/ep1-ba42-rank0.tar
    awesome-training-run/checkpoints/ep1-ba42-rank1.tar
    awesome-training-run/checkpoints/ep1-ba42-rank2.tar
    ...
    

    Corresponding symlinks will be created at:

    awesome-training-run/checkpoints/latest-rank0.tar -> awesome-training-run/checkpoints/ep1-ba42-rank0.tar
    awesome-training-run/checkpoints/latest-rank1.tar -> awesome-training-run/checkpoints/ep1-ba42-rank1.tar
    awesome-training-run/checkpoints/latest-rank2.tar -> awesome-training-run/checkpoints/ep1-ba42-rank2.tar
    ...
    

  • latest_remote_file_name (str, optional) โ€“

    Format string for the checkpointโ€™s latest symlink remote file name. Default: '{run_name}/checkpoints/latest-rank{rank}".

    Whenever a new checkpoint is saved, a symlink is created or updated to point to the latest checkpointโ€™s remote_file_name. The remote file name will be determined by this format string. This parameter has no effect if latest_filename or remote_file_name is None.

    See also

    Uploading Files for notes for file uploading.

    The same format variables for filename are available.

    Leading slashes ('/') will be stripped.

    To disable symlinks in logger, set this parameter to None.

  • overwrite (bool, optional) โ€“ Whether existing checkpoints should be overridden. If False (the default), then the folder must not exist or must not contain checkpoints which may conflict with the current run. Default: False.

  • save_interval (Time | str | int | (State, Event) -> bool) โ€“

    A Time, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved.

    If an integer, checkpoints will be saved every n epochs. If Time or a time-string, checkpoints will be saved according to this interval.

    See also

    checkpoint_periodically()

    If a function, then this function should take two arguments (State, Event). The first argument will be the current state of the trainer, and the second argument will be be Event.BATCH_CHECKPOINT or Event.EPOCH_CHECKPOINT (depending on the current training progress). It should return True if a checkpoint should be saved given the current state and event.

  • num_checkpoints_to_keep (int, optional) โ€“

    The number of checkpoints to keep locally. The oldest checkpoints are removed first. Set to -1 to keep all checkpoints locally. Default: -1.

    Checkpoints will be removed after they have been uploaded. For example, when this callback is used in conjunction with the RemoteUploaderDownloader, set this parameter to 0 to immediately delete checkpoints from the local disk after they have been uploaded to the object store.

    This parameter only controls how many checkpoints are kept locally; checkpoints are not deleted from remote file systems.

  • weights_only (bool) โ€“ If True, save only the model weights instead of the entire training state. This parameter must be False when using DeepSpeed. Default: False.

  • ignore_keys (List[str] | (Dict) -> None, optional) โ€“

    A list of paths for the state_dict of the checkpoint, which, when provided, will be ignored from the state_dict before a checkpoint is saved. Each path is a list of strings specifying the keys to index into state_dict joined together with / as a separator (as PyTorch uses . in parameter names). If a prefix is provided, all children are also ignored (see Example 2). See composer.core.state for the structure of state_dict.

    Example 1: save_ignore_keys = ["state/model/layer1.weights", "state/model/layer1.bias"] would ignore layer 1 weights and bias.

    Example 2: save_ignore_keys = ["state/model/*"] would ignore the entire model, which would have the same effect as the previous example if there was only 1 layer.

    Example 3: save_ignore_keys = ["state/model/layer*.weights"] would ignore all weights in the model.

    Example 4: save_ignore_keys = ["state/rank_zero_seed", "rng"] would reset all randomness when saving the checkpoint.

    If a callable, it should take one argument which is the state_dict. The callable is free to arbitrarily modify the state_dict before it is loaded.

    (default: None)

saved_checkpoints#

The checkpoint timestamps and filepaths.

This list contains tuples of the save timestamp and the checkpoint filepaths. This list will have at most num_checkpoints_to_keep entries. The latest checkpoint will be at the end.

Note

When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the processโ€™s (rankโ€™s) node.

Otherwise, when not using DeepSpeed, each sub-list will contain only one filepath since only rank zero saves checkpoints.

Type

List[Tuple[Timestamp, List[Path]]]