K2 Thinking Browsecomp/HLE Reproducibility | 结果复现

#5
by pandemo - opened

Hi Moonshot team, thank you so much for open-sourcing such impressive models and sharing your research!
Just a question regarding reproducibility of the K2 Thinking benchmarks with tool-use: How can the BrowseComp and HLE evaluation results be replicated? Is the search-agent ("with tools") framework you used for BrowseComp/HLE evaluation open-source, or do you plan to open-source it?
Also, if I can also ask, what the agent framework was used to achieve the SWE-Bench Verified results?
Thanks again for your great work! 🙏


你好 Moonshot 团队,非常感谢你们开源如此令人印象深刻的模型并分享研究成果!
我有一个关于 K2 Thinking benchmarks 可复现性的问题:
在带工具使用(tool-use)的情况下,如何复现 BrowseComp 和 HLE 的评测结果?你们在 BrowseComp/HLE 评估中使用的搜索代理(“with tools”)框架是否开源,或者是否有开源计划?
另外,还想请问一下,实现 SWE-Bench Verified 结果时使用的是哪种代理框架?
再次感谢你们的出色工作!🙏

Moonshot AI org

Thanks for the question.

Basically, the results can be reproduced by integrating three main types of tools — search, browsing, and code tools. As long as the tool descriptions and interfaces are clearly defined, the results should be consistent. In our internal experiments, using different implementations or descriptions of the these tools led to only minor variations in performance.

But the tool quality does have an impact. For example, the relevance of content retrieved by the search tool and the completeness of page parsing by the browsing tool can both affect the final results. Ensuring that the tools perform reliably and stably is important for faithful reproduction of the benchmark results.

Finally, please make sure to implement appropriate context management. Since K2 Thinking is capable of performing 200–300 steps of tool-use and reasoning, the accumulated content from tool outputs can easily exceed the context window limit. Applying an effective context management is essential for stable reproduction.

browsing tool具体是个啥,是qwen deep research或者Minimax M2那种summary browse吗,还是像openai那种用浏览器

Hi Moonshot team, thank you so much for open-sourcing such impressive models and sharing your research!
I also have a question regarding the reproducibility of K2 Thinking benchmarks:

I've tried testing the model using the official Kimi API's web-search tool, as well as the web version of Kimi with search capabilities, but have found it nearly impossible to correctly solve randomly sampled examples from browsecomp.

Could you please provide some guidance on how to reproduce the results? Thank you.

你好 Moonshot 团队,非常感谢你们开源如此令人印象深刻的模型并分享研究成果!
我有一个关于 K2 Thinking benchmarks 可复现性的问题:
我尝试用官方kimi api提供的web-search工具测试模型,以及使用带有搜索的kimi网页版,均几乎无法做对browsecomp中随机采样的一些例题。
请问能否提供一些复现的指导,谢谢。

First, thanks a lot for your awesome work on this project!

I bought the K2-Thinking service from KIMI’s official website and tried to reproduce the BrowseComp benchmark results.
I followed your setup pretty closely — using search, browsing, and code tools — but I could only get about 40% of the reported score.

Here’s my setup in detail:

  • Search: using Serper.dev
  • Browsing: using Jina
  • Code: a cloud-hosted Python sandbox

I’ve tested all three and they seem stable and working fine.

Below are the tool definitions I used:

{
  "type": "function",
  "function": {
    "name": "search",
    "description": "Perform Google web searches then returns a string of the top search results. Accepts multiple queries.",
    "parameters": {
      "type": "object",
      "required": ["query"],
      "properties": {
        "query": {
          "type": "array",
          "items": {"type": "string"},
          "minItems": 1,
          "description": "The list of search queries."
        }
      }
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "browsing",
    "description": "Visit webpage(s) and return the summary of the content.",
    "parameters": {
      "type": "object",
      "properties": {
        "url": {
          "type": "array",
          "items": {"type": "string"},
          "description": "The URL(s) of the webpage(s) to visit."
        },
        "goal": {
          "type": "string",
          "description": "The specific information goal for visiting webpage(s)."
        }
      },
      "required": ["url", "goal"]
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "PythonInterpreter",
    "description": "Executes Python code in a sandboxed environment. Code goes inside <code></code> tags after the JSON block. Output must be printed to stdout with print().",
    "parameters": {
      "type": "object",
      "properties": {},
      "required": []
    }
  }
}

With these tools, TongyiDeepResearcher gets around 44% on BrowseComp.

I’m wondering if there’s anything specific I might have missed — e.g. different search/browse behavior, tool format, or prompt details — that could explain the gap?

Would really appreciate any hints or suggestions

我使用 https://github.com/MiniMax-AI/minimax_search 这个search和browse也只在browsecomp得到40+的分数,我注意到footnote里写了“we employ a simple context management strategy”,可以分享一下context management具体是怎么做的吗,可以带来多少提分,感谢🙏

First, thanks a lot for your awesome work on this project!

I bought the K2-Thinking service from KIMI’s official website and tried to reproduce the BrowseComp benchmark results.
I followed your setup pretty closely — using search, browsing, and code tools — but I could only get about 40% of the reported score.

Here’s my setup in detail:

  • Search: using Serper.dev
  • Browsing: using Jina
  • Code: a cloud-hosted Python sandbox

I’ve tested all three and they seem stable and working fine.

Below are the tool definitions I used:

{
  "type": "function",
  "function": {
    "name": "search",
    "description": "Perform Google web searches then returns a string of the top search results. Accepts multiple queries.",
    "parameters": {
      "type": "object",
      "required": ["query"],
      "properties": {
        "query": {
          "type": "array",
          "items": {"type": "string"},
          "minItems": 1,
          "description": "The list of search queries."
        }
      }
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "browsing",
    "description": "Visit webpage(s) and return the summary of the content.",
    "parameters": {
      "type": "object",
      "properties": {
        "url": {
          "type": "array",
          "items": {"type": "string"},
          "description": "The URL(s) of the webpage(s) to visit."
        },
        "goal": {
          "type": "string",
          "description": "The specific information goal for visiting webpage(s)."
        }
      },
      "required": ["url", "goal"]
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "PythonInterpreter",
    "description": "Executes Python code in a sandboxed environment. Code goes inside <code></code> tags after the JSON block. Output must be printed to stdout with print().",
    "parameters": {
      "type": "object",
      "properties": {},
      "required": []
    }
  }
}

With these tools, TongyiDeepResearcher gets around 44% on BrowseComp.

I’m wondering if there’s anything specific I might have missed — e.g. different search/browse behavior, tool format, or prompt details — that could explain the gap?

Would really appreciate any hints or suggestions

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

First, thanks a lot for your awesome work on this project!

I bought the K2-Thinking service from KIMI’s official website and tried to reproduce the BrowseComp benchmark results.
I followed your setup pretty closely — using search, browsing, and code tools — but I could only get about 40% of the reported score.

Here’s my setup in detail:

  • Search: using Serper.dev
  • Browsing: using Jina
  • Code: a cloud-hosted Python sandbox

I’ve tested all three and they seem stable and working fine.

Below are the tool definitions I used:

{
  "type": "function",
  "function": {
    "name": "search",
    "description": "Perform Google web searches then returns a string of the top search results. Accepts multiple queries.",
    "parameters": {
      "type": "object",
      "required": ["query"],
      "properties": {
        "query": {
          "type": "array",
          "items": {"type": "string"},
          "minItems": 1,
          "description": "The list of search queries."
        }
      }
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "browsing",
    "description": "Visit webpage(s) and return the summary of the content.",
    "parameters": {
      "type": "object",
      "properties": {
        "url": {
          "type": "array",
          "items": {"type": "string"},
          "description": "The URL(s) of the webpage(s) to visit."
        },
        "goal": {
          "type": "string",
          "description": "The specific information goal for visiting webpage(s)."
        }
      },
      "required": ["url", "goal"]
    }
  }
}
{
  "type": "function",
  "function": {
    "name": "PythonInterpreter",
    "description": "Executes Python code in a sandboxed environment. Code goes inside <code></code> tags after the JSON block. Output must be printed to stdout with print().",
    "parameters": {
      "type": "object",
      "properties": {},
      "required": []
    }
  }
}

With these tools, TongyiDeepResearcher gets around 44% on BrowseComp.

I’m wondering if there’s anything specific I might have missed — e.g. different search/browse behavior, tool format, or prompt details — that could explain the gap?

Would really appreciate any hints or suggestions

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

Moonshot AI org

Hi, some differences I’m aware of:

  1. We don’t return a summary of the target pages — instead, we return the original text to the model and let the model browse the relevant information itself.
  2. For search, we use an internal search engine that ranks items based on relevance.
  3. How many tool calls are you allowed at most? Context Management is necessary.
Moonshot AI org

By the way, are you using the official api with temp=1.0? We've noticed that benchmark outcomes can vary across providers. Some third-party endpoints show substantial accuracy drops. Official api to reproduce the results is recommended. Thanks.

We currently allow up to 300 tool calls in total during a single task.

I have a few follow-up questions:

  1. Regarding your internal search engine — how much advantage does it provide compared to standard Google Search on the BrowseComp benchmark?
    If it contributes significantly, does that mean the reported results are difficult to reproduce without access to your internal search system?

  2. About context management — could you share a simple description or example of the strategy you use?
    We’d like to reproduce the setup using Kimi-K2-Thinking, and understanding how you handle context (e.g., memory trimming, summarization, or selection logic) would help a lot.

If context management is indeed the main source of performance gain, how should we interpret this in terms of the model’s reasoning or planning ability?

Thanks again for taking the time to explain — your insights are super helpful for those of us trying to replicate your framework. 🙏

By the way, are you using the official api with temp=1.0? We've noticed that benchmark outcomes can vary across providers. Some third-party endpoints show substantial accuracy drops. Official api to reproduce the results is recommended. Thanks.

Yes, I’m using temperature = 1.0 and official API, exactly as specified in your Hugging Face model card:
https://huggingface.co/moonshotai/Kimi-K2-Thinking

我使用 https://github.com/MiniMax-AI/minimax_search 这个search和browse也只在browsecomp得到40+的分数,我注意到footnote里写了“we employ a simple context management strategy”,可以分享一下context management具体是怎么做的吗,可以带来多少提分,感谢🙏

I wrote a (unofficial) implementation including 'a simple context management strategy' trick, maybe could be helpful:

https://github.com/prnake/kimi-deepresearch

我使用 https://github.com/MiniMax-AI/minimax_search 这个search和browse也只在browsecomp得到40+的分数,我注意到footnote里写了“we employ a simple context management strategy”,可以分享一下context management具体是怎么做的吗,可以带来多少提分,感谢🙏

I wrote a (unofficial) implementation including 'a simple context management strategy' trick, maybe could be helpful:

https://github.com/prnake/kimi-deepresearch

请问您这个实现下,测试得到的browsecomp得分是多少?

我使用 https://github.com/MiniMax-AI/minimax_search 这个search和browse也只在browsecomp得到40+的分数,我注意到footnote里写了“we employ a simple context management strategy”,可以分享一下context management具体是怎么做的吗,可以带来多少提分,感谢🙏

I wrote a (unofficial) implementation including 'a simple context management strategy' trick, maybe could be helpful:

https://github.com/prnake/kimi-deepresearch

请问您这个实现下,测试得到的browsecomp得分是多少?

只保留 3 轮 tool-call 记录的情况下能得到大概 50+ 的分数,平均对话 70 轮

只保留 3 轮 tool-call 记录的情况下能得到大概 50+ 的分数,平均对话 70 轮

That is impressive. By "3 rounds" and "70 turns", does this mean that the browsecomp score of 50+ is still considered pass@1? Or is it considered pass@3?


真厉害!请问这里说的「3轮」和「70轮对话」是什么意思呢?这个在 browsecomp 上得到的 50+ 分数,是按 pass@1 计算的,还是 pass@3 呢?

只保留 3 轮 tool-call 记录的情况下能得到大概 50+ 的分数,平均对话 70 轮

That is impressive. By "3 rounds" and "70 turns", does this mean that the browsecomp score of 50+ is still considered pass@1? Or is it considered pass@3?


真厉害!请问这里说的「3轮」和「70轮对话」是什么意思呢?这个在 browsecomp 上得到的 50+ 分数,是按 pass@1 计算的,还是 pass@3 呢?

只保留最近3轮的搜索结果(参考 https://github.com/prnake/kimi-deepresearch/blob/cadb4f61b52ae3bba5063fe5eaebe8c3d7954083/kimi_deepresearch.py#L58 );平均调用 70 次搜索工具(每次搜3-5个query,且返回全文);缺乏资源和耐心、只测了pass@1。

只保留最近3轮的搜索结果(参考 https://github.com/prnake/kimi-deepresearch/blob/cadb4f61b52ae3bba5063fe5eaebe8c3d7954083/kimi_deepresearch.py#L58 );平均调用 70 次搜索工具(每次搜3-5个query,且返回全文);缺乏资源和耐心、只测了pass@1。

@Papersnake

Understood, thank you for the clarification and for sharing your impressive implementation🙏. By the way, regarding your "50+" result on Browsecomp, are you able to share the more specific result number you had received upon testing?


明白了,感谢你的进一步解释,也感谢你分享如此优秀的实现🙏。
另外,关于你在 Browsecomp 上获得的「50+」成绩,方便的话能否分享一下你实际测到的更具体的分数?

Sign up or log in to comment