Spaces:
Running
Running
File size: 5,463 Bytes
2c07158 cc9a14a 2c07158 b5e73da 2c07158 cc9a14a 2c07158 2f21d72 2c07158 2f21d72 2c07158 b5e73da cc9a14a 2f21d72 2c07158 b5e73da 2c07158 b5e73da 2c07158 b5e73da 2c07158 2f21d72 2c07158 cc9a14a b5e73da 2f21d72 cc9a14a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
from dataclasses import dataclass
from enum import Enum
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
TITLE = """<h1>🇹🇠Thai LLM Leaderboard</h1>"""
# <a href="url"></a>
INTRODUCTION_TEXT = """
The Thai LLM Leaderboard 🇹🇠aims to standardize evaluation methods for large language models (LLMs) in the Thai language, building on <a href="https://github.com/SEACrowd">SEACrowd</a>.
As part of an open community project, we welcome you to submit new evaluation tasks or models.
This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
"""
LLM_BENCHMARKS_TEXT = f"""
The leaderboard currently consists of the following benchmarks:
- <b>Exam</b>
- <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
- <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.
- <b>LLM-as-a-Judge</b>
- <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: A Thai version of <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.
- <b>NLU</b>
- <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
- <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
- <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a corpus of translated and re-annotated English COPA, covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
- <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.
- <b>NLG</b>
- <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
- <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
- <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.
<b>Metric Implementation Details</b>:
- Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
- BLEU is calculated using flores200's tokenizer using HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
- ROUGEL is calculated using PyThaiNLP newmm tokenizer and HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
- LLM-as-a-judge rating is based on OpenAI's gpt-4o-2024-05-13 using the prompt defined in <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.
<b>Reproducibility</b>:
- For the reproducibility of results, we have open-sourced the evaluation pipeline. Please check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.
<b>Acknowledgements</b>:
- We are grateful to previous open-source projects that released datasets, tools, and knowledge. We thank community members for tasks and model submissions. To contribute, please see the submit tab.
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
author={SCB 10X and VISTEC and SEACrowd},
title={Thai LLM Leaderboard},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}
}"""
|