Update README.md
Browse files
README.md
CHANGED
@@ -33,6 +33,7 @@ We evaluated LLaMA3 8B SEA-LIONv2 Instruct on both general language capabilities
|
|
33 |
#### General Language Capabilities
|
34 |
For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
35 |
These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
|
|
|
36 |
Note: For Sentiment Analysis and Toxicity Detection scores, we use a modified F1 score that takes all model generations for those tasks into consideration when calculating the F1 score. The original F1 score calculations strictly exclude any generations that do not exactly follow the pre-defined labels (for instance, a generation such as "The sentiment of this text is positive because..." is excluded from the F1 score calculation). Our modified F1 metric will include such generations into the score.
|
37 |
|
38 |
The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
|
|
|
33 |
#### General Language Capabilities
|
34 |
For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
35 |
These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
|
36 |
+
|
37 |
Note: For Sentiment Analysis and Toxicity Detection scores, we use a modified F1 score that takes all model generations for those tasks into consideration when calculating the F1 score. The original F1 score calculations strictly exclude any generations that do not exactly follow the pre-defined labels (for instance, a generation such as "The sentiment of this text is positive because..." is excluded from the F1 score calculation). Our modified F1 metric will include such generations into the score.
|
38 |
|
39 |
The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
|