Abstract
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
Community
Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte patches instead of tokens.
BLT encodes bytes into dynamic patches using light-weight local models and processes them with a large latent transformer.
Entropy patching dynamically adjusts patch sizes based on data complexity, allowing BLT to allocate more compute to hard predictions and use larger patches for simpler ones. This results in fewer larger processing steps to cover the same data.
BLT unlocks a new scaling dimension by simultaneously growing patch and model size without changing training or inference cost. Patch length scaling quickly overtakes BPE transformer scaling, and the trends look even better at larger scales!
Parameter matched training runs up to 8B params and 4T bytes show that BLT performs well on standard benchmarks, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops.
Credit: https://x.com/garrethleee/status/1868702376754135154
amazing work - I am especially interested in follow-ups regarding the entropy model finetuning because the robustness is probably quite dependent on this, or am I overestimating that?
Incredible work. I wonder if we can add a new layer, a patch of patches... And use this for finetuning
Here is an in-depth explanation of this paper https://ajithp.com/2024/12/15/metas-byte-latent-transformer-revolutionizing-natural-language-processing-with-dynamic-patching/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Star Attention: Efficient LLM Inference over Long Sequences (2024)
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models (2024)
- p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay (2024)
- Attamba: Attending To Multi-Token States (2024)
- LBPE: Long-token-first Tokenization to Improve Large Language Models (2024)
- Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction (2024)
- Retrofitting Large Language Models with Dynamic Tokenization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper