Any plan to open source the search agent framework?

#22
by CherryDurian - opened

I’ve been trying to reproduce the BrowseComp and related results based on your description.
However, even with the same tool setup (search, browsing, and code tools), our in-house implementation performs much lower than what’s reported in your benchmarks.

May I ask if there are any plans to open source the search agent framework (or at least a minimal reference version)?
It would be super helpful for the community to better understand and reproduce the results.

I am wondering the same.

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

I am wondering the same.

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

@CherryDurian

Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?

Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?

And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

@CherryDurian

Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?

Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?

And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."

Yeah, that’s possible.

Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.

Of course, it could also be something with the Kimi model itself — the real reason needs more digging.

Yeah, that’s possible.

Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.

Of course, it could also be something with the Kimi model itself — the real reason needs more digging.

Yes, very good point regarding that Tongyi's summarization within the visit tool might be a source of errors. Please do let the community know if you find anything that more closely reproduces the official results🙏. Also you might have seen this already but https://github.com/prnake/kimi-deepresearch was shared in the other discussion thread and according to the author scores 50+ on browsecomp. Might be of interest.

Also, @CherryDurian , in the other discussion thread, when you had said that Tongyi's implementation "could only get about 40% of the reported score", did you mean that it could only achieve on BrowseComp about:

  1. 40% of 60% (K2's official browsecomp score)= 24%
    or
  2. 40% (on browsecomp) ?

Sign up or log in to comment