synthetic_lm#

Synthetic language modeling datasets used for testing, profiling, and debugging.

Functions

`generate_synthetic_tokenizer`	Generates a synthetic tokenizer based on a tokenizer family.
`synthetic_hf_dataset_builder`	Creates a synthetic `Dataset` and passes it to the preprocessing scripts.

Classes

SyntheticTokenizerParams

Module SyntheticTokenizerParams.

class composer.datasets.synthetic_lm.SyntheticTokenizerParams(tokenizer_model, normalizer, pre_tokenizer, decoder, initial_alphabet, special_tokens, pad_token, trainer_cls, tokenizer_cls)[source]#

Bases: tuple

Module SyntheticTokenizerParams.

composer.datasets.synthetic_lm.generate_synthetic_tokenizer(tokenizer_family, dataset=None, vocab_size=256)[source]#

Generates a synthetic tokenizer based on a tokenizer family.

Parameters

tokenizer_family (str) – Which tokenizer family to emulate. One of [‘gpt2’, ‘bert’].
dataset (Optional[Dataset]) – Optionally, the dataset to train the tokenzier off of. If None, a SyntheticHFDataset will be generated. Default: None.
vocab_size (int) – The size of the tokenizer vocabulary. Defaults to 256.

composer.datasets.synthetic_lm.synthetic_hf_dataset_builder(num_samples, chars_per_sample, column_names, seed=5)[source]#

Creates a synthetic Dataset and passes it to the preprocessing scripts.

Parameters

num_samples (int) – how many samples to use in the synthetic dataset.
chars_per_sample (int) – how many characters each synthetic text sample should be.
column_names (list) – the column names that a dataset should use

Returns

datasets.Dataset – the synthetic HF Dataset object.