Bye tokens, hello patches

Meta announces a better way to scale LLMs

Bye tokens, hello patches
"Output responses from Llama 3 and BLT models for various tasks from CUTE benchmark. BLT model performs better on sequence manipulation tasks compared to the tokenizer-based Llama 3 model. Note that few-shot examples are not shown in the above prompts to maintain clarity."

Do we really need to break text into tokens, or could we work directly with raw bytes?

First, let’s think about how do LLMs currently handle text. They first chop it up into chunks called tokens using rules about common word pieces. This tokenization step has always been a bit of an odd one out. While the rest of the model learns and adapts during training, tokenization stays fixed, based on those initial rules. This can cause problems, especially for languages that aren’t well-represented in the training data or when handling unusual text formats.

Meta’s new BLT architecture (paper, code) takes a different approach. Instead of pre-defining tokens, it looks at the raw bytes of text and dynamically groups them based on how predictable they are. When the next byte is very predictable (like finishing a common word), it groups more bytes together. When the next byte is unpredictable (like starting a new sentence), it processes bytes in smaller groups.

Scaling trends for models trained with fixed inference budgets. Traditional token-based models, like Llama 2 and 3, scale model size based on the inference budget. In contrast, the BLT architecture allows scaling both model size and patch size (ps) together while maintaining the same budget. BLT models with patch sizes 6 and 8 quickly surpass Llama 2 and 3. Larger patch sizes, like 8, become more effective earlier when using higher inference budgets. Vertical lines show key points for compute efficiency and performance crossover.

This dynamic approach leads to three key benefits:

First, it can match the performance of state-of-the-art tokenizer-based models like Llama 3 while offering the option to trade minor performance losses for up to 50% reduction in inference flops. The model saves resources by processing predictable sections more efficiently.

Second, it handles edge cases much better. Consider tasks that require character-level understanding, like correcting misspellings or working with noisy text. BLT significantly outperforms token-based models on these tasks because it can directly access and manipulate individual characters.

Third, it introduces a new way to scale language models. With traditional tokenizer-based models, you’re somewhat constrained in how you can grow them. But BLT lets you simultaneously increase both the model size and the average size of byte groups while keeping the same compute budget. This opens up new possibilities for building more efficient models.

To understand how BLT works in practice, let’s look at its three main components:

  1. A lightweight local encoder that processes raw bytes and groups them based on predictability
  2. A large transformer that processes these groups (called “patches”)
  3. A lightweight local decoder that converts patch representations back into bytes
The BLT architecture has three main modules: a lightweight Local Encoder to convert input bytes into patch representations, a Latent Transformer for processing these patches, and a lightweight Local Decoder to generate the next patch of bytes. BLT uses byte n-gram embeddings and cross-attention to enhance information flow between the Latent Transformer and byte-level modules. Unlike fixed-vocabulary tokenization, BLT dynamically groups bytes into patches, maintaining access to detailed byte-level information.

The entropy-based grouping is particularly clever. BLT uses a small language model to predict how surprising each next byte will be. When it encounters a highly unpredictable byte (like the start of a new word), it creates a boundary and begins a new patch. This way, it dedicates more computational resources to the challenging parts of the text while efficiently handling the easier parts.

I like the results. On standard benchmarks, BLT matches or exceeds Llama 3’s performance. But where it really shines is on tasks requiring character-level understanding. For instance, on the CUTE benchmark testing character manipulation, BLT outperforms token-based models by more than 25 points — and this is despite being trained on 16x less data than the latest Llama model.

The 8B BLT model is compared to the 8B BPE Llama 3, both trained on 1T tokens, using tasks that test robustness to noise and language structure awareness. Best results are in bold, and the overall best results (including Llama 3.1) are underlined. BLT significantly outperforms Llama 3 and even surpasses Llama 3.1 on many tasks, demonstrating that byte-level awareness offers unique advantages not easily achieved with more data.

This points to a future where language models might no longer need the crutch of fixed tokenization. By working directly with bytes in a dynamic way, we could build models that are both more efficient and more capable of handling the full complexity of human language.

What do you think about this approach? Does removing the tokenization step seem like the right direction for language models to evolve? Let me know in the comments or on the AImodels.fyi community Discord. I’d love to hear what you have to say.