RaymondAISG
commited on
Commit
•
844e79e
1
Parent(s):
70a8b9f
Update README.md
Browse files
README.md
CHANGED
@@ -56,8 +56,7 @@ IFEval evaluates a model's ability to adhere to constraints provided in the prom
|
|
56 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
|
57 |
|
58 |
|
59 |
-
For more details on Llama3 8B CPT SEA-LIONv2 Instruct benchmark performance, please refer to the
|
60 |
-
https://leaderboard.sea-lion.ai/
|
61 |
|
62 |
|
63 |
### Usage
|
|
|
56 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
|
57 |
|
58 |
|
59 |
+
For more details on Llama3 8B CPT SEA-LIONv2 Instruct benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
|
|
|
60 |
|
61 |
|
62 |
### Usage
|