synthetic_lm#

Synthetic language modeling datasets used for testing, profiling, and debugging.

Functions

generate_synthetic_tokenizer

Generates a synthetic tokenizer based on a tokenizer family.

synthetic_hf_dataset_builder

Creates a synthetic Dataset and passes it to the preprocessing scripts.

Classes

class composer.datasets.synthetic_lm.SyntheticTokenizerParams(tokenizer_model, normalizer, pre_tokenizer, decoder, initial_alphabet, special_tokens, pad_token, trainer_cls, tokenizer_cls)[source]#

Bases: tuple

Module SyntheticTokenizerParams.

composer.datasets.synthetic_lm.generate_synthetic_tokenizer(tokenizer_family, dataset=None, vocab_size=256)[source]#

Generates a synthetic tokenizer based on a tokenizer family.

Parameters
  • tokenizer_family (str) โ€“ Which tokenizer family to emulate. One of [โ€˜gpt2โ€™, โ€˜bertโ€™].

  • dataset (Optional[Dataset]) โ€“ Optionally, the dataset to train the tokenzier off of. If None, a SyntheticHFDataset will be generated. Default: None.

  • vocab_size (int) โ€“ The size of the tokenizer vocabulary. Defaults to 256.

composer.datasets.synthetic_lm.synthetic_hf_dataset_builder(num_samples, chars_per_sample, column_names, seed=5)[source]#

Creates a synthetic Dataset and passes it to the preprocessing scripts.

Parameters
  • num_samples (int) โ€“ how many samples to use in the synthetic dataset.

  • chars_per_sample (int) โ€“ how many characters each synthetic text sample should be.

  • column_names (list) โ€“ the column names that a dataset should use

Returns

datasets.Dataset โ€“ the synthetic HF Dataset object.