metadata
license: mit
language:
- en
tags:
- babylm
- tokenizer
datasets:
- nilq/babylm-100M
Baby Tokenizer (Uncased)
Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
Usage
Transformers
from transformers import AutoTokenizer
tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")
Tokenizers
from tokenizers import Tokenizer
tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")
Data
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
- CHILDES (child-directed speech)
- Subtitles (speech)
- BNC (speech)
- TED talks (speech)
- children's books (simple written language).
Specifications
- Vocabulary size: 20k
- Alphabet limit: 150
- Minimum token frequency: 100