๐Ÿ“Š Evaluation#

To track training progress, validation datasets can be provided to the Composer Trainer through the eval_dataloader parameter. The trainer will compute evaluation metrics on the evaluation dataset at a frequency specified by the the Trainer parameter eval_interval.

from composer import Trainer

trainer = Trainer(
    ...,
    eval_dataloader=my_eval_dataloader,
    eval_interval="1ep",  # Default is every epoch
)

The metrics should be provided by ComposerModel.get_metrics(). For more information, see the โ€œMetricsโ€ section in ๐Ÿ›ป ComposerModel.

To provide a deeper intuition, hereโ€™s pseudocode for the evaluation logic that occurs every eval_interval:

metrics = model.get_metrics(train=False)

for batch in eval_dataloader:
    outputs, targets = model.eval_forward(batch)
    metrics.update(outputs, targets)  # implements the torchmetrics interface

metrics.compute()
  • The trainer iterates over eval_dataloader and passes each batch to the modelโ€™s ComposerModel.eval_forward() method.

  • Outputs of model.eval_forward are used to update the metrics (a torchmetrics.Metric returned by .ComposerModel.get_metrics).

  • Finally, metrics over the whole validation dataset are computed.

Note that the tuple returned by ComposerModel.eval_forward() provide the positional arguments to metric.update. Please keep this in mind when using custom models and/or metrics.

Multiple Datasets#

If there are multiple validation datasets that may have different metrics, use Evaluator to specify each pair of dataloader and metrics. This class is just a container for a few attributes:

  • label: a user-specified name for the evaluator.

  • dataloader: PyTorch DataLoader or our DataSpec.

    See DataLoaders for more details.

  • metric_names: list of names of metrics to track.

For example, the GLUE tasks for language models can be specified as in the following example:

from composer.core import Evaluator
from composer.models.nlp_metrics import BinaryF1Score

glue_mrpc_task = Evaluator(
    label='glue_mrpc',
    dataloader=mrpc_dataloader,
    metric_names=['BinaryF1Score', 'MulticlassAccuracy']
)

glue_mnli_task = Evaluator(
    label='glue_mnli',
    dataloader=mnli_dataloader,
    metric_names=['MulticlassAccuracy']
)

trainer = Trainer(
    ...,
    eval_dataloader=[glue_mrpc_task, glue_mnli_task],
    ...
)

Note that metric_names must be a subset of the metrics provided by the model in ComposerModel.get_metrics().