composer.datasets.lm_datasets#
Generic dataset class for self-supervised training of autoregressive and masked language models.
Hparams
These classes are used with yahp
for YAML
-based configuration.
Defines a generic dataset class for self-supervised training of autoregressive and masked language models. |
- class composer.datasets.lm_datasets.LMDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=<factory>, split=None, tokenizer_name=None, use_masked_lm=None, num_tokens=0, mlm_probability=0.15, seed=5, subsample_ratio=1.0, max_seq_length=1024)[source]#
Bases:
composer.datasets.hparams.DatasetHparams
,composer.datasets.hparams.SyntheticHparamsMixin
Defines a generic dataset class for self-supervised training of autoregressive and masked language models.
- Parameters
use_synthetic (bool, optional) โ Whether to use synthetic data. Default:
False
.synthetic_num_unique_samples (int, optional) โ The number of unique samples to allocate memory for. Ignored if
use_synthetic
isFalse
. Default:100
.synthetic_device (str, optional) โ The device to store the sample pool on. Set to
'cuda'
to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to'cpu'
to move data between host memory and the device on every batch. Ignored ifuse_synthetic
isFalse
. Default:'cpu'
.synthetic_memory_format โ The
MemoryFormat
to use. Ignored ifuse_synthetic
isFalse
. Default:'CONTIGUOUS_FORMAT'
.datadir (list) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ List containing the string of the path to the HuggingFace Datasets directory.
split (str) โ Whether to use
'train'
,'test'
, or'validation'
split.tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
use_masked_lm (bool) โ Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) โ Number of tokens to train on.
0
will train on all tokens in the dataset. Default:0
.mlm_probability (float, optional) โ If using masked language modeling, the probability with which tokens will be masked. Default:
0.15
.seed (int, optional) โ Random seed for generating train and validation splits. Default:
5
.subsample_ratio (float, optional) โ Proportion of the dataset to use. Default:
1.0
.train_sequence_length (int, optional) โ Sequence length for training dataset. Default:
1024
.val_sequence_length (int, optional) โ Sequence length for validation dataset. Default:
1024
.
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
Iterable | DataSpec โ An iterable that yields batches, or if the dataset yields batches that need custom
processing, a :class:`~core.data_spec.DataSpec`.