composer.callbacks.run_directory_uploader#

Periodically upload run_directory to a blob store during training.

Classes

RunDirectoryUploader

Callback to upload the run directory to a blob store.

class composer.callbacks.run_directory_uploader.RunDirectoryUploader(object_store_provider_hparams, object_name_prefix=None, num_concurrent_uploads=4, upload_staging_folder=None, use_procs=True, upload_every_n_batches=100)[source]#

Bases: composer.core.callback.Callback

Callback to upload the run directory to a blob store.

This callback checks the run directory for new or modified files at the end of every epoch, and after every upload_every_n_batches batches. This callback detects new or modified files based on the file modification timestamp. Only files that have a newer last modified timestamp since the last upload will be uploaded.

Example
>>> osphparams = ObjectStoreProviderHparams(
...     provider="s3",
...     container="run-dir-test",
...     key_environ="OBJECT_STORE_KEY",
...     secret_environ="OBJECT_STORE_SECRET",
...     region="us-west-2",
...     )
>>> # construct trainer object with this callback
>>> run_directory_uploader = RunDirectoryUploader(osphparams)
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration="1ep",
...     callbacks=[run_directory_uploader],
... )
>>> # trainer will run this callback whenever the EPOCH_END
>>> # is triggered, like this:
>>> _ = trainer.engine.run_event(Event.EPOCH_END)

Note

This callback blocks the training loop to copy files from the run_directory to the upload_staging_folder and to queue these files to the upload queues of the workers. Actual upload happens in the background. While all uploads happen in the background, here are some additional tips for minimizing the performance impact:

  • Ensure that upload_every_n_batches is sufficiently infrequent as to limit when the blocking scans of the run directory and copies of modified files. However, do not make it too infrequent in case if the training process unexpectedly dies, since data written after the last upload may be lost.

  • Set use_procs=True (the default) to use background processes, instead of threads, to perform the file uploads. Processes are recommended to ensure that the GIL is not blocking the training loop when performance CPU operations on uploaded files (e.g. computing and comparing checksums). Network I/O happens always occurs in the background.

  • Provide a RAM disk path for the upload_staging_folder parameter. Copying files to stage on RAM will be faster than writing to disk. However, you must have sufficient excess RAM on your system, or you may experience OutOfMemory errors.

Parameters
  • object_store_provider_hparams (ObjectStoreProviderHparams) โ€“

    ObjectStoreProvider hyperparameters object

    See ObjectStoreProviderHparams for documentation.

  • object_name_prefix (str, optional) โ€“

    A prefix to prepend to all object keys. An objectโ€™s key is this prefix combined with its path relative to the run directory. If the container prefix is non-empty, a trailing slash (โ€˜/โ€™) will be added if necessary. If not specified, then the prefix defaults to the run directory. To disable prefixing, set to the empty string.

    For example, if object_name_prefix = 'foo' and there is a file in the run directory named bar, then that file would be uploaded to foo/bar in the container.

  • num_concurrent_uploads (int, optional) โ€“ Maximum number of concurrent uploads. Defaults to 4.

  • upload_staging_folder (str, optional) โ€“ A folder to use for staging uploads. If not specified, defaults to using a TemporaryDirectory().

  • use_procs (bool, optional) โ€“ Whether to perform file uploads in background processes (as opposed to threads). Defaults to True.

  • upload_every_n_batches (int, optional) โ€“ Interval at which to scan the run directory for changes and to queue uploads of files. In addition, uploads are always queued at the end of the epoch. Defaults to every 100 batches.

get_uri_for_uploaded_file(local_filepath)[source]#

Get the object store provider uri for a specific local filepath.

Parameters

local_filepath (Union[Path, str]) โ€“ The local file for which to get the uploaded uri.

Returns

str โ€“ The uri corresponding to the upload location of the file.