c4_hparams#
C4 (Colossal Cleaned Common Crawl) dataset hyperparameters.
Hparams
These classes are used with yahp
for YAML
-based configuration.
Builds a |
|
Builds a |
- class composer.datasets.c4_hparams.C4DatasetHparams(drop_last=True, shuffle=True, split=None, num_samples=None, tokenizer_name=None, max_seq_len=None, group_method=None, mlm=False, mlm_probability=0.15, shuffle_buffer_size=10000, seed=5)[source]#
Bases:
composer.datasets.dataset_hparams.DatasetHparams
Builds a
DataSpec
for the C4 (Colossal Cleaned Common Crawl) dataset.- Parameters
split (str) โ What split of the dataset to use. Either
'train'
or'validation'
. Default:None
.num_samples (int) โ The number of post-processed token samples, used to set epoch size of the
torch.utils.data.IterableDataset
. Default:None
.tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. Default:
None
.max_seq_len (int) โ The max sequence length of each token sample. Default:
None
.group_method (str) โ How to group text samples into token samples. Either truncate or concat. Default:
None
.mlm (bool) โ Whether or not to use masked language modeling. Default:
False
.mlm_probability (float) โ If
mlm==True
, the probability that tokens are masked. Default:0.15
.shuffle (bool) โ Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default:
False
.shuffle_buffer_size (int) โ If
shuffle=True
, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default:10000
.seed (int) โ If
shuffle=True
, what seed to use for shuffling operations. Default:5
.drop_last (bool) โ Whether to drop the last samples for the last batch. Default:
True
.
- Returns
DataLoader โ A PyTorch
DataLoader
object.
- class composer.datasets.c4_hparams.StreamingC4Hparams(drop_last=True, shuffle=True, remote='s3://mosaicml-internal-dataset-c4/mds/1/', local='/tmp/mds-cache/mds-c4/', split='train', tokenizer_name='bert-base-uncased', max_seq_len=512, group_method='truncate', mlm=False, mlm_probability=0.15, max_retries=2, timeout=120)[source]#
Bases:
composer.datasets.dataset_hparams.DatasetHparams
Builds a
DataSpec
for the StreamingC4 (Colossal Cleaned Common Crawl) dataset.- Parameters
remote (str) โ Remote directory (S3 or local filesystem) where dataset is stored. Default:
's3://mosaicml-internal-dataset-c4/mds/1/'
local (str) โ Local filesystem directory where dataset is cached during operation. Default:
'/tmp/mds-cache/mds-c4/'
split (str) โ What split of the dataset to use. Either
'train'
or'val'
. Default:'train'
.tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. Default:
'bert-base-uncased'
.max_seq_len (int) โ The max sequence length of each token sample. Default:
512
.group_method (str) โ How to group text samples into token samples. Currently only truncate is supported.
mlm (bool) โ Whether or not to use masked language modeling. Default:
False
.mlm_probability (float) โ If
mlm==True
, the probability that tokens are masked. Default:0.15
.max_retries (int) โ Number of download re-attempts before giving up. Default: 2.
timeout (float) โ How long to wait for shard to download before raising an exception. Default: 120 sec.