composer.callbacks.run_directory_uploader#
Periodically upload run_directory
to a blob store during training.
Classes
Callback to upload the run directory to a blob store. |
- class composer.callbacks.run_directory_uploader.RunDirectoryUploader(object_store_provider_hparams, object_name_prefix=None, num_concurrent_uploads=4, upload_staging_folder=None, use_procs=True, upload_every_n_batches=100)[source]#
Bases:
composer.core.callback.Callback
Callback to upload the run directory to a blob store.
This callback checks the run directory for new or modified files at the end of every epoch, and after every
upload_every_n_batches
batches. This callback detects new or modified files based on the file modification timestamp. Only files that have a newer last modified timestamp since the last upload will be uploaded.- Example
>>> osphparams = ObjectStoreProviderHparams( ... provider="s3", ... container="run-dir-test", ... key_environ="OBJECT_STORE_KEY", ... secret_environ="OBJECT_STORE_SECRET", ... region="us-west-2", ... ) >>> # construct trainer object with this callback >>> run_directory_uploader = RunDirectoryUploader(osphparams) >>> trainer = Trainer( ... model=model, ... train_dataloader=train_dataloader, ... eval_dataloader=eval_dataloader, ... optimizers=optimizer, ... max_duration="1ep", ... callbacks=[run_directory_uploader], ... ) >>> # trainer will run this callback whenever the EPOCH_END >>> # is triggered, like this: >>> _ = trainer.engine.run_event(Event.EPOCH_END)
Note
This callback blocks the training loop to copy files from the
run_directory
to theupload_staging_folder
and to queue these files to the upload queues of the workers. Actual upload happens in the background. While all uploads happen in the background, here are some additional tips for minimizing the performance impact:Ensure that
upload_every_n_batches
is sufficiently infrequent as to limit when the blocking scans of the run directory and copies of modified files. However, do not make it too infrequent in case if the training process unexpectedly dies, since data written after the last upload may be lost.Set
use_procs=True
(the default) to use background processes, instead of threads, to perform the file uploads. Processes are recommended to ensure that the GIL is not blocking the training loop when performance CPU operations on uploaded files (e.g. computing and comparing checksums). Network I/O happens always occurs in the background.Provide a RAM disk path for the
upload_staging_folder
parameter. Copying files to stage on RAM will be faster than writing to disk. However, you must have sufficient excess RAM on your system, or you may experience OutOfMemory errors.
- Parameters
object_store_provider_hparams (ObjectStoreProviderHparams) โ
ObjectStoreProvider hyperparameters object
See
ObjectStoreProviderHparams
for documentation.object_name_prefix (str, optional) โ
A prefix to prepend to all object keys. An objectโs key is this prefix combined with its path relative to the run directory. If the container prefix is non-empty, a trailing slash (โ/โ) will be added if necessary. If not specified, then the prefix defaults to the run directory. To disable prefixing, set to the empty string.
For example, if
object_name_prefix = 'foo'
and there is a file in the run directory namedbar
, then that file would be uploaded tofoo/bar
in the container.num_concurrent_uploads (int, optional) โ Maximum number of concurrent uploads. Defaults to 4.
upload_staging_folder (str, optional) โ A folder to use for staging uploads. If not specified, defaults to using a
TemporaryDirectory()
.use_procs (bool, optional) โ Whether to perform file uploads in background processes (as opposed to threads). Defaults to True.
upload_every_n_batches (int, optional) โ Interval at which to scan the run directory for changes and to queue uploads of files. In addition, uploads are always queued at the end of the epoch. Defaults to every 100 batches.