composer.profiler#

Modules

composer.profiler.dataloader_profiler

Profiler to measure the time it takes the data loader to return a batch.

composer.profiler.json_trace

Outputs profiling data in JSON trace format.

composer.profiler.json_trace_merger

Merge trace files together.

composer.profiler.profiler_hparams

Example usage and definition of hparams.

composer.profiler.system_profiler

Profiler to record system level metrics.

composer.profiler.torch_profiler

Profiler to collect torch performance metrics during training.

Performance profiling tools.

The profiler gathers performance metrics during a training run that can be used to diagnose bottlenecks and facilitate model development.

The metrics gathered include:

  • Duration of each Event during training

  • Time taken by the data loader to return a batch

  • Host metrics such as CPU, system memory, disk and network utilization over time

  • Execution order, latency and attributes of PyTorch operators and GPU kernels (see torch.profiler)

The following example demonstrates how to setup and perform profiling on a simple training run.

 1# Copyright 2021 MosaicML. All Rights Reserved.
 2
 3from torch.utils.data import DataLoader
 4from torchvision import datasets, transforms
 5
 6from composer import Trainer
 7from composer.models import MNIST_Classifier
 8
 9# Specify Dataset and Instantiate DataLoader
10batch_size = 2048
11data_directory = "../data"
12
13mnist_transforms = transforms.Compose([transforms.ToTensor()])
14
15train_dataset = datasets.MNIST(data_directory, train=True, download=True, transform=mnist_transforms)
16train_dataloader = DataLoader(train_dataset,
17                              batch_size=batch_size,
18                              shuffle=False,
19                              drop_last=True,
20                              pin_memory=True,
21                              persistent_workers=True,
22                              num_workers=8)
23
24# Instantiate Model
25model = MNIST_Classifier(num_classes=10)
26
27# Instantiate the trainer
28profiler_trace_file = "profiler_traces.json"
29torch_trace_dir = "torch_profiler"
30
31trainer = Trainer(model=model,
32                  train_dataloader=train_dataloader,
33                  eval_dataloader=train_dataloader,
34                  max_duration=2,
35                  device="gpu",
36                  validate_every_n_batches=-1,
37                  validate_every_n_epochs=-1,
38                  precision="amp",
39                  train_subset_num_batches=16,
40                  profiler_trace_file=profiler_trace_file,
41                  prof_skip_first=0,
42                  prof_wait=0,
43                  prof_warmup=1,
44                  prof_active=4,
45                  prof_repeat=1,
46                  torch_profiler_trace_dir=torch_trace_dir)
47
48# Run training
49trainer.fit()

It is required to specify an output profiler_trace_file during Trainer initialization to enable profiling. The profiler_trace_file will contain the profiling trace data once the profiling run completes. By default, the Profiler, DataLoaderProfiler and SystemProfiler will be active. The TorchProfiler is disabled by default.

To activate the TorchProfiler, the torch_profiler_trace_dir must be specified in addition to the profiler_trace_file argument. The torch_profiler_trace_dir will contain the Torch Profiler traces once the profiling run completes. The Profiler will automatically merge the Torch traces in the torch_profiler_trace_dir into the profiler_trace_file, allowing users to view a unified trace.

The complete traces can be viewed by in a Google Chrome browser navigating to chrome://tracing and loading the profiler_trace_file. Here is an example trace file:

Example Profiler Trace File

Additonal details an be found in the Profiler Guide.

Classes

Marker

Record when something happens or how long something takes.

Profiler

Records the duration of Trainer Event using the Marker API.

ProfilerAction

Action states for the Profiler that define whether or not events are being recorded to the trace file.

ProfilerEventHandler

Base class for profiler event handlers.

class composer.profiler.Marker(profiler, name, actions, record_instant_on_start, record_instant_on_finish, categories)#

Record when something happens or how long something takes.

Used by the Engine to measure the duration of Event during training.

Note

Marker should not be instantiated directly; instead use Profiler.marker().

Markers can record the following types of events:

  1. Duration: Records the start and stop time of an event of interest (Marker.start(), Marker.finish()).

  2. Instant: Record time a particular event occurs, but not the full duration (Marker.instant()).

  3. Counter: The value of a variable at given time (Marker.counter()).

A Marker can also be used as a context manager or decorator to record a duration:

  1. Use a Marker with a context manager:

    >>> def something_to_measure():
    ...     print("something_to_measure")
    >>> marker = profiler.marker("foo")
    >>> with marker:
    ...     something_to_measure()
    something_to_measure
    
  2. Use a Marker as a decorator:

    >>> marker = profiler.marker("foo")
    >>> @marker
    ... def something_to_measure():
    ...     print("something_to_measure")
    >>> something_to_measure()
    something_to_measure
    
counter(values)[source]#

Record a counter event.

To record a counter event:

>>> marker = profiler.marker("foo")
>>> counter_event = 5
>>> marker.counter({"counter_event": counter_event})
>>> counter_event = 10
>>> marker.counter({"counter_event": counter_event})
finish()[source]#

Record the end of a duration event.

See Marker.start() for a usage example.

instant()[source]#

Record an instant event.

To record an instant event:

>>> def something_to_measure():
...     print("something_to_measure")
>>> marker = profiler.marker("instant")
>>> marker.instant()
>>> something_to_measure()
something_to_measure
start()[source]#

Record the start of a duration event.

To record the duration of an event, invoke Marker.start() followed by Marker.finish():

>>> def something_to_measure():
...     print("something_to_measure")
>>> marker = profiler.marker("foo")
>>> marker.start()
>>> something_to_measure()
something_to_measure
>>> marker.finish()
class composer.profiler.Profiler(state, event_handlers=(), skip_first=0, wait=0, warmup=1, active=4, repeat=1, merged_trace_file='merged_profiler_trace.json')#

Records the duration of Trainer Event using the Marker API.

Specifically, it records:

  1. The duration of each section of the training loop, such as the time it takes to perform a forward pass, backward pass, batch, epoch, etc.

  2. The latency of each algorithm and callback adds when executing on each event.

The event_handlers then record and save this data to a trace file. If no event_handlers is specified, the JSONTraceHandler is used by default.

Note

The Composer Trainer creates an instance of Profiler when merged_trace_file is provided. The user should not create and directly register an instance of Profiler when using the Composer Trainer.

Parameters
  • state (State) โ€“ The state.

  • event_handlers (Sequence[ProfilerEventHandler]) โ€“ Event handlers which record and save profiling data to traces.

  • skip_first (int, optional) โ€“ Number of batches to skip profiling at epoch start. Defaults to 0.

  • wait (int, optional) โ€“ For each profiling cycle, number of batches to skip at the beginning of the cycle. Defaults to 0.

  • warmup (int, optional) โ€“ For each profiling cycle, number of batches to be in the warmup state after skipping wait batches. Defaults to 1.

  • active (int, optional) โ€“ For each profiling cycle, number of batches to record after warming up. Defaults to 4.

  • repeat (int, optional) โ€“ Number of profiling cycles to perform per epoch. Set to 0 to record the entire epoch. Defaults to 1.

  • merged_trace_file (str, optional) โ€“ Name of the trace file, relative to the run directory. Defaults to merged_profiler_trace.json.

property event_handlers#

Profiler event handlers.

get_action(batch_idx)[source]#

Get the current ProfilerAction for the profiler, based upon the parameters skip_first, wait, warmup, active, and repeat.

The profiler skips the first skip_first batches in every epoch. Then, it performs a cycle of skipping wait batches, warming up for warmup batches, and recording active batches. It repeats this cylce up to repeat times per epoch (or for the entire epoch, if repeat is 0). This logic repeats every epoch.

Parameters

batch_idx (int) โ€“ The index of the current batch.

Returns

ProfilerAction โ€“ The current action.

marker(name, actions=(<ProfilerAction.WARMUP: 'warmup'>, <ProfilerAction.ACTIVE: 'active'>), record_instant_on_start=False, record_instant_on_finish=False, categories=())[source]#

Create and get an instance of a Marker.

If a Marker with the specified name does not already exist, it will be created. Otherwise, the existing instance will be returned.

For example:

>>> marker = profiler.marker("foo")
>>> marker
<composer.profiler.Marker object at ...>

Note

Profiler.marker() should be used to construct markers. Marker should not be instantiated directly by the user.

Please see Marker.start() and Marker.finish() for usage on creating markers to measure duration events, Marker.instant() for usage on creating markers to mark instant events and Marker.counter() for usage on creating markers for counting.

Parameters
  • name (str) โ€“ The name for the Marker.

  • actions (Sequence[ProfilerAction], optional) โ€“ ProfilerAction states to record on. Defaults to (ProfilerAction.WARMUP, ProfilerAction.ACTIVE).

  • record_instant_on_start (bool, optional) โ€“ Whether to record an instant event whenever the marker is started. Defaults to False.

  • record_instant_on_finish (bool, optional) โ€“ Whether to record an instant event whenever the marker is finished. Defaults to False.

  • categories (Union[List[str], Tuple[str, ...]], optional) โ€“ Categories for this marker. Defaults to None.

Returns

Marker โ€“ Instance of Marker.

class composer.profiler.ProfilerAction(value)#

Bases: composer.utils.string_enum.StringEnum

Action states for the Profiler that define whether or not events are being recorded to the trace file.

SKIP#

Do not record new events to the trace. Any events started during ACTIVE or WARMUP will be recorded upon finish.

WARMUP#

Record all events to the trace except those requiring a warmup period to initialize data structures (e.g., torch.profiler).

ACTIVE#

Record all events to the trace.

class composer.profiler.ProfilerEventHandler#

Bases: composer.core.callback.Callback, abc.ABC

Base class for profiler event handlers.

Event handlers are responsible for logging trace Markers and saving them to a file in a given trace format for viewing.

Subclasses should implement process_duration_event(), process_instant_event() and process_counter_event(). These methods are invoked by the Profiler whenever there is an event to record.

Since ProfilerEventHandler subclasses Callback, event handlers can run on Events (such as on Event.INIT to open files or on Event.BATCH_END to periodically dump data to files) and use Callback.close() to perform any cleanup.

process_counter_event(name, categories, wall_clock_time_ns, global_rank, pid, values)[source]#

Called by the Profiler whenever there is an counter event to record.

Parameters
  • name (str) โ€“ The name of the event.

  • categories (List[str] | Tuple[str, ...]) โ€“ The categories for the event.

  • wall_clock_time_ns (int) โ€“ The time.time_ns() corresponding to the event.

  • global_rank (int) โ€“ The global_rank corresponding to the event.

  • pid (int) โ€“ The pid corresponding to the event.

  • values (Dict[str, int | float]) โ€“ The values corresponding to this counter event.

process_duration_event(name, categories, is_start, timestamp, wall_clock_time_ns, global_rank, pid)[source]#

Called by the Profiler whenever there is a duration event to record.

This method is called twice for each duration event โ€“ once with is_start = True, and then again with is_start = False. Interleaving events are not permitted. Specifically, for each event (identified by the name), a call with is_start = True will be followed by a call with is_start = False before another call with is_start = True.

Parameters
  • name (str) โ€“ The name of the event.

  • categories (Union[List[str], Tuple[str, ...]]) โ€“ The categories for the event.

  • is_start (bool) โ€“ Whether the event is a start event or end event.

  • timestamp (Timestamp) โ€“ Snapshot of the training time.

  • wall_clock_time_ns (int) โ€“ The time.time_ns() corresponding to the event.

  • global_rank (int) โ€“ The global_rank corresponding to the event.

  • pid (int) โ€“ The pid corresponding to the event.

process_instant_event(name, categories, timestamp, wall_clock_time_ns, global_rank, pid)[source]#

Called by the Profiler whenever there is an instant event to record.

Parameters
  • name (str) โ€“ The name of the event.

  • categories (List[str] | Tuple[str, ...]) โ€“ The categories for the event.

  • timestamp (Timestamp) โ€“ Snapshot of current training time.

  • wall_clock_time_ns (int) โ€“ The time.time_ns() corresponding to the event.

  • global_rank (int) โ€“ The global_rank corresponding to the event.

  • pid (int) โ€“ The pid corresponding to the event.