Once again reinforcement learning (when rightly done) proves to be the ultimate “tool” that drives reasoning in AI models. OpenAI o1 (aka strawberry 🍓) can think and learn while thinking before giving a response. This is how we humans approach solving difficult problems.
In technical terms, o1 is trained with an RL algorithm to think productively using its chain of thought. In other words “the longer it thinks, the better it does on reasoning tasks”. Similar to how AlphaGo was able to beat the world champion at Go.
Read more: https://openai.com/index/learning-to-reason-with-llms/