CodeRankLLM
is a 7B LLM fine-tuned for listwise code-reranking. When combined with performant code retrievers like CodeRankEmbed
, it significantly enhances the quality of retrieved results for various code retrieval tasks.
We release the scripts to evaluate our model's performance here.
Training
Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset CoRNStack, filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage Qwen-2.5-32B-Instruct LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with Qwen2.5-Coder-7B-Instruct and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
- Downloads last month
- 29