Classifica RAG

#9
by anakin87 - opened

Hey guys, thanks for your efforts, which I appreciate much. ๐Ÿ™

I have some doubts about "Classifica RAG".

If I am not wrong, https://huggingface.co/datasets/squad_it is used for evaluation.
If so, it may not be the most appropriate benchmark as it evaluates extractive QA.

In fact, the best-performing model is worse than some older and smaller non-generative models:

If you can confirm that squad_it is used for evaluation, this part of the leaderboard sounds confusing to me.
This opens the door to some possibilities:

  • drop this tab
  • rename the tab (something like "Classifica QA estrattivo"), but this would also require including older models that are not in the same league
  • find a better dataset to evaluate RAG (might be worth checking benchmarks for instruction/chat-models?). This is a big topic and I don't know Italian datasets well. In any case, this recent article by Omar Sanseviero may be helpful: https://osanseviero.github.io/hackerllama/blog/posts/llm_evals/

Sorry for the intrusion. I may be in the wrong, but I am curious about your opinions!

mii-llm org

You are 100% right and I agree that the current state of the RAG leaderboard is suboptimal and not a good proxy of the model's ability to perform RAG.
But I am working on it, I have some ideas and I'm going to improve it and re-test every model

A short-term fix is to include a new metric called "partial match" that is true if one of the squad_it answers is in the model output. This will avoid penalizing "generative models" that output a whole answer instead of "just the right keyword".

Thanks for the feedback btw! I'm open to everything that could improve the leaderboard!

Thanks!

Evaluating models that are good at RAG in a generic setting is difficult, I would say.

Thinking more about the leaderboard and referring back to that article, I would like to have another tab that evaluates instruction-following/chat behaviors (see https://osanseviero.github.io/hackerllama/blog/posts/llm_evals/#chat-models-evaluation).
In fact, I noticed that "Classifica Generale" refers to benchmarks commonly used to rank base models, so some good fine-tuned models may not emerge from this leaderboard.

In any case, thanks for the work, which I will continue to follow!

mii-llm org

The aim of "Classifica Generale" is to compare Italian models with Mixtral, LLama2-70b in Italian task!
We will expand the scope surely in the future once we get more gpus, and the article you linked is a very good start!

FinancialSupport changed discussion status to closed

Sign up or log in to comment