diff --git "a/app/src/index.html" "b/app/src/index.html" --- "a/app/src/index.html" +++ "b/app/src/index.html" @@ -16,7 +16,7 @@ "title": "Scaling test-time compute", "description": "This blog presents our work on scaling test-time compute for large language models, we share an open implementation of DeepMind's Scaling Test Time Compute paper and introduce a new method; parallel beamsearch that improves scaling performance.", "published": "December 15, 2024", - "affiliation": {"name": "HuggingFace"}, + "affiliation": {"name": "Hugging Face"}, "authors": [ { "author":"Edward Beeching", @@ -39,7 +39,7 @@

Scaling test-time compute

- FineTasks + Scaling test-time compute with open models
@@ -48,239 +48,111 @@ -

Following the strong community reception of our FineWeb English dataset, we have been hard at work on a multilingual version, which will cover 1000+ languages (that we hope to release soon!).

+

Over the last few years, the scaling of train-time compute has dominated the progress of large language models (LLMs). Although this paradigm has proven to be remarkably effective, the resources needed to pretrain ever larger models are becoming prohibitively expensive, with billion-dollar clusters already on the horizon. This trend has sparked significant interest in a complementary approach: test-time compute scaling. Rather than relying on ever-larger pretraining budgets, test-time methods use dynamic inference strategies that allow models to “think longer” on harder problems. A prominent example is OpenAI’s o1 model, which shows consistent improvement on difficult math problems as one increases the amount of test-time compute:

-

However, we quickly encountered a significant challenge: how can one effectively evaluate models across different languages during training?

+
-

For English, it's straightforward: we can utilize well-established benchmarks like MMLU or HellaSwag, widely used by most labs and implemented in all the major evaluation frameworks. Unfortunately, non-English tasks are often scarce and lack broader community validation and, when available, are frequently of questionable quality: many are machine-translated and may even include English words in their formulations. Additionally, they are often unsuitable for early pre-training evaluation due to suboptimal task formulations and/or too high difficulty resulting in random scores.

+

Although we don’t know how o1 was trained, recent research from DeepMind shows that test-time compute can be scaled optimally through strategies like iterative self-refinement or using a reward model to perform search over the space of solutions. By adaptively allocating test-time compute per prompt, smaller models can rival—and sometimes even outperform—their larger, more resource-intensive counterparts. Scaling test-time compute is especially advantageous when memory is constrained and the available hardware is not sufficient to run a larger model. However, this promising approach was demonstrated with closed-source models, and no implementation details or code were released 😢.

Over the past months we’ve been diving deep in trying to reverse engineer and reproduce several of these results and are finally happy to share some of our knowledge. More precisely, in this blog post we’ll cover:

So how well does compute-optimal scaling work in practice? Check out this plot where the tiny 1B and 3B Llama Instruct models outperform their much larger 8B and 70B siblings on the challenging MATH-500 benchmark if you give them enough “time to think” 🤯:

In the rest of this blog post, we’ll dive deep into the ingredients behind results like this one and walk you through practical strategies for implementing test-time compute scaling.

-

To address these challenges, we developed a scalable and data-driven framework for evaluation task selection, which allows anyone to choose strong model evaluations for their language from existing tasks! We then applied this framework to a set of 9 diverse languages, resulting in the creation of FineTasks - a comprehensive and diverse multilingual evaluation suite.

+

Strategies for test-time compute scaling

There are two main strategies for scaling test-time compute:

In this blog post, we’ll concentrate on search-based methods as they represent a practical and scalable solution for test-time compute optimization. In particular, we’ll examine the three strategies illustrated below:

With an understanding of the key search strategies, let’s move on to how we evaluated them in practice.

-

In this blog post, we discuss:

-
    -
  1. Our data-driven process to create a multilingual evaluation suite: FineTasks
  2. -
  3. Results of evaluating 35 major open and closed-source models on FineTasks
  4. -
  5. A guide for extending FineTasks to your target language
  6. -
+

Experimental setup

As illustrated in the diagram above, our experimental setup involves a pipeline with the following steps:

  1. We begin by feeding a math problem to an LLM, which generates NN partial solutions, e.g. an intermediate step in a derivation.
  1. Each step is scored by a PRM, which estimates the probability of each step to eventually reach the correct final answer.
    1. The steps and PRM scores are then used by a given search strategy to select which partial solutions should be further explored to generate the next round of intermediate steps.
  1. Once the search strategy terminates, the final candidate solutions are ranked by the PRM to produce the final answer.

To compare various search strategies, we used the following open models and datasets: