arxiv:2412.09871

Byte Latent Transformer: Patches Scale Better Than Tokens

Published on Dec 13

· Submitted by

artidoro on Dec 17

#1 Paper of the day

Upvote

Authors:

Artidoro Pagnoni ,

Benjamin Muller ,

Margaret Li ,

Chunting Zhou ,

Jason Weston ,

Gargi Ghosh ,

Mike Lewis ,

Ari Holtzman ,

Abstract

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

View arXiv page View PDF Add to collection

Community

artidoro

Paper author Paper submitter 1 day ago

Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte patches instead of tokens.

BLT encodes bytes into dynamic patches using light-weight local models and processes them with a large latent transformer.

Entropy patching dynamically adjusts patch sizes based on data complexity, allowing BLT to allocate more compute to hard predictions and use larger patches for simpler ones. This results in fewer larger processing steps to cover the same data.

BLT unlocks a new scaling dimension by simultaneously growing patch and model size without changing training or inference cost. Patch length scaling quickly overtakes BPE transformer scaling, and the trends look even better at larger scales!

Parameter matched training runs up to 8B params and 4T bytes show that BLT performs well on standard benchmarks, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops.

Credit: https://x.com/garrethleee/status/1868702376754135154

duinamit

about 23 hours ago

amazing work - I am especially interested in follow-ups regarding the entropy model finetuning because the robustness is probably quite dependent on this, or am I overestimating that?