Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Sep 13
Post
1222
The breakthrough in OpenAI’s release goes way beyond just another family of capable models - it’s a monumental leap in LLM reasoning capabilities. One in which the limitations in pre-training become obsolete and the dream of scaling during inference becomes a reality.

Once again reinforcement learning (when rightly done) proves to be the ultimate “tool” that drives reasoning in AI models. OpenAI o1 (aka strawberry 🍓) can think and learn while thinking before giving a response. This is how we humans approach solving difficult problems.

In technical terms, o1 is trained with an RL algorithm to think productively using its chain of thought. In other words “the longer it thinks, the better it does on reasoning tasks”. Similar to how AlphaGo was able to beat the world champion at Go.

Read more: https://openai.com/index/learning-to-reason-with-llms/

I'm really against the mindset of "LLMs can reason" cuz it actually does a probability distribution on the trained data when infering. What we should do is just more sampling with high temp when coding to explore new ways and get multiple implementations on them. It think itll be really effective then 1 expensive call. You can find more here: https://arxiv.org/abs/2407.21787

·

I used to think this way, but as it turned these models don't just do probability distribution, they are actually learning features between these distributions and to use these features during inference require some "reasoning", capable models (gpt4, gpt3, claude3) prior to OpenAI o1 could barely reason through tasks, o1 now utilizes RL to boost reasoning during inference - scaling at inference has been a huge challenge but somehow OAI figured it out with RL. Obviously we are at an early stage of this breakthrough, proof of reasoning will become clearer in subsequent versions of o1.

Geoffrey Hinton gave a talk on this topic: https://www.youtube.com/watch?v=N1TEjTeQeg0

In this post