Post
3379
๐๐ผ๐ผ๐ด๐น๐ฒ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ : ๐๐ฐ๐ฎ๐น๐ถ๐ป๐ด ๐๐ฝ ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฏ๐ฒ๐ฎ๐๐ ๐ญ๐ฐ๐
๐น๐ฎ๐ฟ๐ด๐ฒ๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐
Remember scaling laws? These are empirical laws that say "the bigger your model, the better it gets". More precisely, "as your compute increases exponentially, loss decreases in a linear fashion". They have wild implications, suggesting that spending 100x more training compute would make you super-LLMs. That's why companies are racing to build the biggest AI superclusters ever, and Meta bought 350k H100 GPUs, which probably cost in the order of $1B.
But think of this : we're building huge reasoning machines, but only ask them to do one pass through the model to get one token of the final answer : i.e., we expend a minimal effort on inference. That's like building a Caterpillar truck and making it run on a lawnmower's motor. ๐๐ต Couldn't we optimize this? ๐ค
๐ก So instead of scaling up on training by training even bigger models on many more trillions of tokens, Google researchers explored this under-explored avenue : scaling up inference compute.
They combine two methods to use more compute : either a reviser that iterated to adapt the model distribution, or generate N different completions (for instance through Beam Search) and select only the best one using an additional verifier model.
They use a Palm-2 model (released in May 23) on the MATH dataset : Palm-2 has the advantage of getting a low performance on MATH, but not zero, so that improvements will be noticeable.
And the results show that for the same fixed amount of inference compute:
๐ฅ a smaller model with more effort on decoding beats a x14 bigger model using naive greedy sampling.
That means that you can divide your training costs by 14 and still get the same perf for the same inference cost!
Take that, scaling laws. Mark Zuckerberg, you're welcome, hope I can get some of these H100s.
Read the paper here ๐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2408.03314)
Remember scaling laws? These are empirical laws that say "the bigger your model, the better it gets". More precisely, "as your compute increases exponentially, loss decreases in a linear fashion". They have wild implications, suggesting that spending 100x more training compute would make you super-LLMs. That's why companies are racing to build the biggest AI superclusters ever, and Meta bought 350k H100 GPUs, which probably cost in the order of $1B.
But think of this : we're building huge reasoning machines, but only ask them to do one pass through the model to get one token of the final answer : i.e., we expend a minimal effort on inference. That's like building a Caterpillar truck and making it run on a lawnmower's motor. ๐๐ต Couldn't we optimize this? ๐ค
๐ก So instead of scaling up on training by training even bigger models on many more trillions of tokens, Google researchers explored this under-explored avenue : scaling up inference compute.
They combine two methods to use more compute : either a reviser that iterated to adapt the model distribution, or generate N different completions (for instance through Beam Search) and select only the best one using an additional verifier model.
They use a Palm-2 model (released in May 23) on the MATH dataset : Palm-2 has the advantage of getting a low performance on MATH, but not zero, so that improvements will be noticeable.
And the results show that for the same fixed amount of inference compute:
๐ฅ a smaller model with more effort on decoding beats a x14 bigger model using naive greedy sampling.
That means that you can divide your training costs by 14 and still get the same perf for the same inference cost!
Take that, scaling laws. Mark Zuckerberg, you're welcome, hope I can get some of these H100s.
Read the paper here ๐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2408.03314)