Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Select your tasks here | |
# --------------------------------------------------- | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("custom|snli-acc|0", "snli_acc", "SNLI Accuracy") | |
task1 = Task("custom|heq-qa-tlnls|0", "heq_tlnls", "QA TLNLS (HeQ)") | |
task2 = Task("custom|sentiment-acc|0", "sentiment_acc", "Sentiment Acc (Mafat)") | |
task3 = Task("custom|winograd-acc|0", "winograd_acc", "Winograd (Binary) Acc (V. Schwartz)") | |
task4 = Task("custom|he-en-trans-bleu|0", "sentence_bleu", "Translation BLEU") | |
task5 = Task("custom|ilfacts-acc|0", "ilfacts_acc", "Israeli Trivia") | |
NUM_FEWSHOT = 0 # Change with your few shot | |
# --------------------------------------------------- | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">Hebrew LLM Leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
<div style="display: flex; justify-content: center;"> | |
<div style="max-width: 70vw;"> | |
Welcome to the Leaderboard for open Hebrew LLMs. The leaderboard ranks the different models according to their success on various tasks on Hebrew. | |
The leaderboard was created and is operated by a collaboration of [Mafat / The Israeli National Program for NLP in Hebrew and Arabic](https://nnlp-il.mafat.ai/) and [DICTA: The Israel Center for Text Analysis](https://dicta.org.il/). | |
<div dir="rtl" style="text-align: right"> | |
讘专讜讻讬诐 讛讘讗讬诐 诇诇讜讞 讛转讜爪讗讜转 砖诇 诪讜讚诇讬 LLM 驻转讜讞讬诐 讘注讘专讬转. 诇讜讞 讛转讜爪讗讜转 诪讚专讙 讗转 讛诪讜讚诇讬诐 讛砖讜谞讬诐 诇驻讬 讛爪诇讞转诐 讘诪砖讬诪讜转 砖讜谞讜转 讘注讘专讬转. | |
诇讜讞 讛转讜爪讗讜转 谞讜爪专 讜诪转讜驻注诇 注诇 讬讚讬 砖讬转讜祝 驻注讜诇讛 讘讬谉 [诪驻讗"转 / 讛转讜讻谞讬转 讛诇讗讜诪讬转 讛讬砖专讗诇讬转 诇-NLP 讘注讘专讬转 讜讘注专讘讬转](https://nnlp-il.mafat.ai/) 讜[讚讬拽讟讛: 讛诪专讻讝 讛讬砖专讗诇讬 诇谞讬转讜讞 讟拽住讟讬诐](https://dicta.org.il/) | |
</div> | |
<div style="display: flex; flex-direction: row; justify-content: space-around; align-items: center" dir="ltr"> | |
<a href="https://dicta.org.il/"> | |
<img src="file/logos/dicta-logo.jpg" alt="Dicta Logo" style="max-height: 65px"> | |
</a> | |
<a href="https://nnlp-il.mafat.ai/"> | |
<img src="file/logos/mafat-logo.jpg" alt="Mafat Logo" style="max-height: 100px"> | |
</a> | |
</div> | |
</div> | |
</div> | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = f""" | |
## How it works | |
We have curated 4 datasets for benchmarking the quality of the LLMs in Hebrew. All of the benchmarks test the base model using a few-shot prompt. Note that the tests specifically evaluate the model's abilities regarding Hebrew, without regard for the capabilities of the model in other languages. | |
1. QA TLNLS (HeQ) | |
- **Source**: We use the test subset of the HeQ dataset, released by Amir Cohen [here](https://aclanthology.org/2023.findings-emnlp.915/). Data can be found [here](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset). | |
- **Scoring**: We score the results using the `tlnls` scoring method proposed in the paper released with HeQ, which accounts for the linguistic properties of Hebrew language. | |
- **Number of examples**: 1,436 prompts. | |
- **Few-Shot Format**: For every context paragraph in the dataset, the few-shot prompt is formatted with the context paragraph, followed by 3 questions and answers on that paragraph, and finally with the desired question unanswers. | |
For example: | |
<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0;'> | |
<p>讘砖谞转 2012, 讛转诪讜讚讚讛 诇专讗砖讜谞讛 讘驻专讬讬诪专讬讝 砖诇 诪驻诇讙转 讛注讘讜讚讛 诇拽专讗转 讛讘讞讬专讜转 诇讻谞住转 讛转砖注 注砖专讛 讜讛讙讬注讛 诇诪拽讜诐 讛志36 讘专砖讬诪讛 讛讗专爪讬转 (讛讘讟讞转 讬讬爪讜讙 诇讗讬砖讛). 讘志2015 诇拽专讗转 讛讘讞讬专讜转 诇讻谞住转 讛注砖专讬诐, 讛转诪讜讚讚讛 讜专讘讬谉 讘驻专讬讬诪专讬讝 砖诇 诪驻诇讙转 讛注讘讜讚讛 讜讛讜爪讘讛 讘诪拽讜诐 讛-22 讘专砖讬诪转 讛诪讞谞讛 讛爪讬讜谞讬 诇讻谞住转, 讗砖专 砖讜专讬讬谉 诇讗讬砖讛 讜谞讘讞专讛 诇讻谞住转. 讘砖谞转 讛讻讛讜谞讛 讛专讗砖讜谞讛 砖诇讛 讘讻谞住转, 讛注谞讬拽 诇讛 讛诪讻讜谉 讛讬砖专讗诇讬 诇讚诪讜拽专讟讬讛 讗转 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉 诇砖谞转 2016. 讞讘专讛 讘讜讜注讚转 讛讞讜抓 讜讘讬讟讞讜谉, 砖诐 讛讬讗 讞讘专讛 讘讜讜注讚转 讛诪砖谞讛 诇讻讜讞 讗讚诐. 讬讝诪讛 讜讬砖讘讛 讘专讗砖 讜讜注讚转 讛诪砖谞讛 诇讘讞讬谞转 诪砖拽 讛讗砖专讗讬 讘讬砖专讗诇. 讬讝诪讛 讜讞讘专讛 讘讜讜注讚转 讛讞拽讬专讛 讛驻专诇诪谞讟专讬转 诇讘讞讬谞转 诪砖拽 讛讗砖专讗讬 讘讬砖专讗诇, 讜讻谉 讞讘专讛 讘讜讜注讚转 讛讻诇讻诇讛, 讜讜注讚转 讛讻谞住转 讜讛讜讜注讚讛 讛诪讬讜讞讚转 诇讝讻讜讬讜转 讛讬诇讚, 讜讘讜讜注讚转 讛诪砖谞讛 诇拽讬讚讜诐 注住拽讬诐 拽讟谞讬诐 讜讘讬谞讜谞讬讬诐</p> | |
砖讗诇讛: 讘讗讬讝讛 驻专住 讝讻转讛 讜专讘讬谉? <br/> | |
转砖讜讘讛: 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉 诇砖谞转 2016 | |
砖讗诇讛: 诪讬 诪注谞讬拽 讗转 讗讜转 讛驻专诇诪谞讟专 讛诪爪讟讬讬谉?<br/> | |
转砖讜讘讛: 讛诪讻讜谉 讛讬砖专讗诇讬 诇讚诪讜拽专讟讬讛 | |
砖讗诇讛: 诪转讬 讛转拽讬讬诪讜 讛讘讞讬专讜转 诇讻谞住转 讛注砖专讬诐? <br/> | |
转砖讜讘讛: 讘志2015 | |
砖讗诇讛: 诇讗讬讝讜 讻谞住转 谞讻谞住讛 讜专讘讬谉 诇专讗砖讜谞讛? <br/> | |
转砖讜讘讛: | |
</blockquote> | |
2. Sentiment Acc (Mafat) | |
- **Source**: We use a test subset of an early version of the Hebrew Sentiment dataset, released by Mafat \& NNLP-IL [here](https://www.facebook.com/groups/MDLI1/permalink/2681774131986618/). The latest version of the data can be found [here](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset) (albeit it is different than the data we used). | |
- **Scoring**: We compute the accuracy score on the predictions, expecting either "讞讬讜讘讬", "砖诇讬诇讬", or "谞讟专诇讬". | |
- **Number of examples**: 3,000 examples, 1,000 from each category. These examples were selected by a linguist tagger. | |
- **Few-Shot Format**: For every prompt, we provide 9 few-shot examples, 3 from each category, randomly shuffled. | |
For example: | |
<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0'> | |
<p> | |
诪砖驻讟: 诪砖驻讟 讞讬讜讘讬 <br/> | |
转砖讜讘讛: 讞讬讜讘讬 | |
诪砖驻讟: 诪砖驻讟 砖诇讬诇讬 <br/> | |
转砖讜讘讛: 砖诇讬诇讬 | |
诪砖驻讟: 诪砖驻讟 谞讟专诇讬 <br/> | |
转砖讜讘讛: 谞讟专诇讬 | |
... | |
诪砖驻讟: 诪砖驻讟 讻诇砖讛讜 <br/> | |
转砖讜讘讛: | |
</blockquote> | |
3. Winograd (Binary) Acc | |
- **Source**: We use `A Translation of the Winograd Schema Challenge to Hebrew`, translated by Dr. Vered Schwartz. The data can be found [here](https://www.cs.ubc.ca/~vshwartz/resources/winograd_he.jsonl). | |
- **Scoring**: We provide in the prompt the two possible answers, and compute the accuracy score. | |
- **Number of examples**: 278 examples. | |
- **Few-Shot Format**: For every prompt, we provide 5 few-shot examples, and then the question at hand. Each example is formatted with the input sentence with the question, the possible answers, and the expected answer. | |
For example: | |
<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0'> | |
<p> | |
砖讗诇讛: 讛砖讜讟专讬诐 注爪专讜 讗转 讞讘专讬 讛讻谞讜驻讬讛. 讛诐 谞讬讛诇讜 讗专讙讜谉 砖诇 住讞专 讘住诪讬诐. 诪讬 谞讬讛诇讜? <br/> | |
讗驻砖专讜讬讜转: "讞讘专讬 讛讻谞讜驻讬讛" 讗讜 "讛砖讜讟专讬诐"<br/> | |
转砖讜讘讛: 讞讘专讬 讛讻谞讜驻讬讛 | |
... | |
砖讗诇讛: 讛砖讜注诇讬诐 讛讬讜 诪讙讬注讬诐 讘诇讬诇讜转 诇转拽讜祝 讗转 讛转专谞讙讜诇讬诐, 讗讝 讛讬讬转讬 爪专讬讱 诇砖诪讜专 注诇讬讛诐. 注诇 诪讬 讛讬讬转讬 爪专讬讱 诇砖诪讜专?<br/> | |
讗驻砖专讜讬讜转: "讛转专谞讙讜诇讬诐" 讗讜 "讛砖讜注诇讬诐"<br/> | |
转砖讜讘讛: | |
</blockquote> | |
4. Translation BLEU | |
- **Source**: We use the aligned translation corpus `NeuLabs-TedTalks`, which can be found [here](https://opus.nlpl.eu/NeuLab-TedTalks/en&he/v1/NeuLab-TedTalks). | |
- **Scoring**: We use the `sacrebleu.sentence_blue` scoring function. | |
- **Number of examples**: We took a random 1,000 examples which were 30-40 words in length from the aligned corpus, and compute the mean score for translating those examples from English to Hebrew, and from Hebrew to English (a total of 2,000 examples). | |
- **Few-Shot Format**: For every prompt, we provide 3 few-shot examples of an English sentence and the Hebrew equivalent. The order depends on the direction that we are attempting to translate to. | |
For example: | |
<blockquote style="background-color: #f0f0f0;"> | |
<p> | |
English: Some sentence in English<br/> | |
Hebrew: 诪砖驻讟 讘注讘专讬转. | |
... | |
English: Some sentence to translate to Hebrew <br/> | |
Hebrew: | |
</blockquote> | |
5. SNLI Accuracy | |
- **Source**: We took a sample of documents from the test-subset of the official SNLI corpus. | |
- **Scoring**: We compute the accuracy score on the predictions, expecting either "住转讬专讛", "讛转讗诪讛", or "讻诇讜诐". | |
- **Number of examples**: There are a total of 210 examples - 70 from each class - where each example was translated using [Dicta's translation engine](https://translate.dicta.org.il), and then manually reviewed and corrected as needed. | |
- **Few-Shot Format**: For every prompt, we provide 12 few-shot examples, 4 from each category. | |
For example: | |
<blockquote dir="rtl" style='text-align: right; background-color: #f0f0f0'> | |
<p> | |
讛谞讞转 讬住讜讚: 谞注专 诪谞讙谉 讘讞爪讜爪专转讜 讘诪讛诇讱 讛讜驻注讛 注诐 诇讛拽转讜.<br/> | |
讛砖注专讛: 诇讗祝 讗讞讚 讗讬谉 讞爪讜爪专讛.<br/> | |
转砖讜讘讛: 住转讬专讛<br/> | |
... | |
讛谞讞转 讬住讜讚: 讛谞注专讛 诇讘讜砖讛 讘诪注讬诇 讞讜诐, 讘注讜讚讛 驻讜住注转 讘砖诇讙.<br/> | |
讛砖注专讛: 讛讙讘专转 讛诇讜讘砖转 诪注讬诇 诪讞驻砖转 讗转 讻诇讘讛 讛讗讜讘讚.<br/> | |
转砖讜讘讛: 讻诇讜诐<br/> | |
... | |
讛谞讞转 讬住讜讚: 住驻讬谞转志驻讗专 讘讛 讗谞砖讬诐 注讜诇讬诐 讜讬讜专讚讬诐.<br/> | |
讛砖注专讛: 讗谞砖讬诐 注讜诇讬诐 讜讬讜专讚讬诐 诪住驻讬谞讜转.<br/> | |
转砖讜讘讛: 讛转讗诪讛<br/> | |
... | |
讛谞讞转 讬住讜讚: 讛谞讞讛 讞讚砖讛<br/> | |
讛砖注专讛: 讛砖注专讛 讞讚砖讛<br/> | |
转砖讜讘讛: | |
</p> | |
</blockquote> | |
6. Israeli Trivia | |
- **Source**: We use a corpus of several hundred trivia questions about general world knowledge, specifically on Israel-related facts. We extend immense gratitude to Avraham Elitzur for helping us curate this dataset. | |
- **Scoring**: We provide in the prompt multiple-choice answers, and compute the accuracy score. | |
- **Number of examples**: 300 examples. | |
- **Few-Shot Format**: For every prompt, we provide 5 few-shot examples, and then the question at hand. Each example is formatted with the question, the possible answers, and the expected answer. | |
For example: | |
<blockquote dir="rtl" style="background-color: #f0f0f0;"> | |
<p> | |
砖讗诇讛: 讘讗讬讝讜 注讬专 谞注专讻讛 讛讻专讝转 讛诪讚讬谞讛 讛专砖诪讬转 注诇 讬讚讬 讚讜讚 讘谉 讙讜专讬讜谉?<br/> | |
讗驻砖专讜讬讜转: "讬专讜砖诇讬诐" / "讗砖拽诇讜谉" / "转诇 讗讘讬讘" / "爪驻转"<br/> | |
转砖讜讘讛: 转诇 讗讘讬讘 <br/> | |
... | |
砖讗诇讛: 砖讗诇讛 讞讚砖讛<br/> | |
讗驻砖专讜讬讜转: "讗驻砖专讜转 讗" / "讗驻砖专讜转 讘" ...<br/> | |
转砖讜讘讛: | |
</blockquote> | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
## Important Note | |
Due to budget restrictions, we have a cap on the number of models that can be tested a month. Please only send your model when you are ready for testing. We also have limits on the number of models that can be sent per/user. | |
## Some good practices before submitting a model | |
### 1) Make sure you can load your model and tokenizer using AutoClasses: | |
```python | |
from transformers import AutoConfig, AutoModel, AutoTokenizer | |
config = AutoConfig.from_pretrained("your model name", revision=revision) | |
model = AutoModel.from_pretrained("your model name", revision=revision) | |
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) | |
``` | |
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. | |
Note: make sure your model is public! | |
Note: if your model needs `use_remote_code=True`, we do not support this option automatically, please reach out to the email listed below if you would like to run a model that requires `trust_remote_code=True`. | |
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) | |
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! | |
### 3) Make sure your model has an open license! | |
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 馃 | |
### 4) Fill up your model card | |
When we add extra information about models to the leaderboard, it will be automatically taken from the model card | |
## In case of model failure | |
If your model is displayed in the `FAILED` category, its execution stopped. | |
Make sure you have followed the above steps first. | |
If everything is done and the model still won't run, please reach out to `shaltiel@dicta.org.il` with the details. | |
Note: Larger models aren't automatically approved to run, if you wish to evaluate a larger model please reach out to the email above. | |
""" | |