PleIAs
/

Pleias-Pico

Model card Files Files and versions Community

Pclanglais commited on 13 days ago

Commit

90a928c

•

1 Parent(s): 6229320

Update README.md

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -72,6 +72,24 @@ For best results we recommend the following setting:
 * Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
 * Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
 ## Ethical Considerations
 pleias-pico model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.

 * Deterministic generation (temp = 0) and no repetition penalty (which is unsurprisingly detrimental to the accuracy of citations).
 * Standardized hashes of 16 characters. While the model has been trained on many other patterns (including full bibliographic entries), this has proven the most convenient for systematic citation parsing.
+### RAG Evaluation
+We evaluate Pico and Nano models on a RAG task. As existing benchmarks are largely limited to English, we develop a custom multilingual RAG benchmark. We synthetically generate queries and small sets of documents. To evaluate, we prompted models with the query and documents. We then ran a head-to-head ELO-based tournament with GPT-4o as judge. We [release the prompts and generations for all models we compared](https://huggingface.co/datasets/PleIAs/Pleias-1.0-eval/tree/main/RAGarena). Our nano (1.2B) model outperforms Llama 3.2 1.1B and EuroLLM 1.7B. Our pico (350M) model outperforms other models in its weight class, such as SmolLM 360M and Qwen2.5 500M, in addition to much larger models, such as Llama 3.2 1.1B and EuroLLM 1.7B.
+| **Rank** | **Model**                | **ELO**    |
+|----------|--------------------------|------------|
+| 1        | Qwen2.5-Instruct-7B      | 1294.6     |
+| 2        | Llama-3.2-Instruct-8B    | 1269.8     |
+| 3        | **Pleias-nano-1.2B-RAG**   | **1137.5** |
+| 4        | Llama-3.2-Instruct-3B    | 1118.1     |
+| 5        | Qwen2.5-Instruct-3B      | 1078.1     |
+| 6        | **Pleias-pico-350M-RAG** | **1051.2** |
+| 7        | Llama-3.2-1B-Instruct    | 872.3      |
+| 8        | EuroLLM-1.7B-Instruct    | 860.0      |
+| 9        | SmolLM-360M-Instruct     | 728.6      |
+| 10       | Qwen2.5-0.5B-Instruct    | 722.2      |
+| 11       | SmolLM-1.7B-Instruct     | 706.3      |
 ## Ethical Considerations
 pleias-pico model, like all large language models, carries inherent ethical risks that require careful consideration. Our approach to mitigating these risks begins at the data level, where we exclusively use vetted sources, deliberately excluding CommonCrawl. The primary challenge comes from our public domain dataset component, which contains historical texts that may reflect outdated social norms and potentially harmful language, particularly regarding minoritized groups.