In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.
๐ Given its input sentence like "๐๐ฉ๐ข๐ต ๐ช๐ด ๐ต๐ฉ๐ฆ 7๐ต๐ฉ ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ? ๐๐ฉ๐ฆ 7๐ต๐ฉ ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next. For instance: "๐๐จ" gets score 0.56, and "๐๐๐ฃ" gets score 0.35.
๐ค ๐๐ซ๐๐๐๐ฒ ๐๐๐๐จ๐๐ข๐ง๐ is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day). In our example, the next highest score token might be "๐๐จ", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "๐๐๐ฃ" could have been completed with "๐ฃ๐ฆ ๐ฐ๐ฃ๐ต๐ข๐ช๐ฏ๐ฆ๐ฅ ๐ง๐ณ๐ฐ๐ฎ ๐ค๐ฐ๐ฎ๐ฑ๐ถ๐ต๐ช๐ฏ๐จ ๐ฑ๐ณ๐ฆ๐ท๐ช๐ฐ๐ถ๐ด ๐๐ช๐ฃ๐ฐ๐ฏ๐ข๐ค๐ค๐ช ๐ฏ๐ถ๐ฎ๐ฃ๐ฆ๐ณ๐ด ๐ง๐ช๐ณ๐ด๐ต", which steers the LLM towards a correct reasoning!
๐บ๏ธ ๐๐๐๐ฆ ๐ฌ๐๐๐ซ๐๐ก improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "๐๐จ" and the "๐๐๐ฃ" completion could be tested. โ
๐ I've created a tool to let you visualize it, thank you @joaogante for your great help! ๐๐ง๐ฎ ๐๐ฉ ๐๐๐ง๐: m-ric/beam_search_visualizer