sagorsarker
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,11 +11,13 @@ tags:
|
|
11 |
- llama-3
|
12 |
- llama-factory
|
13 |
license: llama3.2
|
|
|
|
|
14 |
---
|
15 |
|
16 |
## Model Information
|
17 |
|
18 |
-
This model is a continually
|
19 |
|
20 |
**Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture.
|
21 |
|
@@ -29,9 +31,9 @@ This model is a continually pretrained version of the [meta-llama/Llama-3.2-1B](
|
|
29 |
|
30 |
**Model Release Date:** October 24, 2024
|
31 |
|
32 |
-
**Status:** This is a static model trained on an offline dataset. Future versions may be released
|
33 |
|
34 |
-
**License:** We are using
|
35 |
|
36 |
|
37 |
## How to use
|
@@ -67,14 +69,14 @@ pipe("আমাদের দেশের নাম")
|
|
67 |
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __6B__ tokens.
|
68 |
|
69 |
Data sources summary:
|
70 |
-
- Web documents: Extract, clean, filter common crawl data
|
71 |
-
- Books: Extract, clean, filter
|
72 |
- Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
|
73 |
- Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
|
74 |
- Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
|
75 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
76 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
77 |
-
- Others: We scrap some selected
|
78 |
|
79 |
|
80 |
## Benchmarks
|
@@ -82,11 +84,11 @@ Data sources summary:
|
|
82 |
In this section, we report the results for __titulm-llama-3.2-1b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
|
83 |
|
84 |
### Evaluation Datasets
|
85 |
-
We evaluated our
|
86 |
|
87 |
#### Bangla Benchmark datasets
|
88 |
We evaluated the models on the following datasets:
|
89 |
-
- [Bangla MMLU](): A
|
90 |
- [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
91 |
- [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
92 |
- [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa-bn): A Bangla translation of the Piqa dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
@@ -94,14 +96,17 @@ We evaluated the models on the following datasets:
|
|
94 |
|
95 |
#### English Benchmark datasets
|
96 |
- [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
|
97 |
-
- [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question
|
98 |
- [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
|
99 |
- [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
|
100 |
- [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
|
101 |
|
102 |
### Evaluation Results
|
103 |
|
104 |
-
#### Evaluation
|
|
|
|
|
|
|
105 |
| Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA BN | OpenBook QA BN | PIQA BN |
|
106 |
|---------------------------------|---------|-------------|----------|-------------------|----------------|---------|
|
107 |
| llama-3.2-1b | 0-shot | **0.29** | **0.55** | 0.22 | 0.33 | 0.53 |
|
@@ -109,7 +114,16 @@ We evaluated the models on the following datasets:
|
|
109 |
| hishab/titulm-llama-3.2-1b-v1.0 | 0-shot | 0.28 | 0.56 | **0.28** | **0.33** | **0.55**|
|
110 |
| | 5-shot | 0.28 | - | **0.31** | **0.34** | **0.57**|
|
111 |
|
112 |
-
#### Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
|
115 |
### Instruction Tuned Models
|
@@ -118,5 +132,4 @@ We evaluated the models on the following datasets:
|
|
118 |
### Intended Use
|
119 |
- Bangla text generation
|
120 |
- Bangla language understanding tasks
|
121 |
-
- Bangla instruction fine-tuning tasks
|
122 |
-
|
|
|
11 |
- llama-3
|
12 |
- llama-factory
|
13 |
license: llama3.2
|
14 |
+
base_model:
|
15 |
+
- meta-llama/Llama-3.2-1B
|
16 |
---
|
17 |
|
18 |
## Model Information
|
19 |
|
20 |
+
This model is a continually pre-trained version of the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in tasks related to Bangla language understanding evaluation benchmarks and text generation.
|
21 |
|
22 |
**Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture.
|
23 |
|
|
|
31 |
|
32 |
**Model Release Date:** October 24, 2024
|
33 |
|
34 |
+
**Status:** This is a static model trained on an offline dataset. Future versions may be released to improve model capabilities.
|
35 |
|
36 |
+
**License:** We are using a similar license of Llama 3.2. Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).
|
37 |
|
38 |
|
39 |
## How to use
|
|
|
69 |
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __6B__ tokens.
|
70 |
|
71 |
Data sources summary:
|
72 |
+
- Web documents: Extract, clean, and filter common crawl data
|
73 |
+
- Books: Extract, clean, and filter book data
|
74 |
- Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
|
75 |
- Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
|
76 |
- Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
|
77 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
78 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
79 |
+
- Others: We scrap some selected website data, used open-source data, and used some other data sources
|
80 |
|
81 |
|
82 |
## Benchmarks
|
|
|
84 |
In this section, we report the results for __titulm-llama-3.2-1b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
|
85 |
|
86 |
### Evaluation Datasets
|
87 |
+
We evaluated our pre-trained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, it's English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
|
88 |
|
89 |
#### Bangla Benchmark datasets
|
90 |
We evaluated the models on the following datasets:
|
91 |
+
- [Bangla MMLU](): A private multiple choice question dataset developed by Hishab curated from various sources.
|
92 |
- [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
93 |
- [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
94 |
- [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa-bn): A Bangla translation of the Piqa dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
|
|
|
96 |
|
97 |
#### English Benchmark datasets
|
98 |
- [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
|
99 |
+
- [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers .
|
100 |
- [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
|
101 |
- [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
|
102 |
- [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
|
103 |
|
104 |
### Evaluation Results
|
105 |
|
106 |
+
#### Evaluation of Bangla Benchmark datasets
|
107 |
+
- **llama-3.2-1b** outperforms in **Bangla MMLU** and **BoolQ BN** in the 0-shot setting with scores of **0.29** and **0.55**.
|
108 |
+
- **hishab/titulm-llama-3.2-1b-v1.0** performs better in **Commonsense QA BN**, **OpenBook QA BN**, and **PIQA BN**, achieving the highest scores in both 0-shot and 5-shot settings, with a maximum score of **0.57** in PIQA BN.
|
109 |
+
|
110 |
| Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA BN | OpenBook QA BN | PIQA BN |
|
111 |
|---------------------------------|---------|-------------|----------|-------------------|----------------|---------|
|
112 |
| llama-3.2-1b | 0-shot | **0.29** | **0.55** | 0.22 | 0.33 | 0.53 |
|
|
|
114 |
| hishab/titulm-llama-3.2-1b-v1.0 | 0-shot | 0.28 | 0.56 | **0.28** | **0.33** | **0.55**|
|
115 |
| | 5-shot | 0.28 | - | **0.31** | **0.34** | **0.57**|
|
116 |
|
117 |
+
#### Evaluation of English Benchmark datasets
|
118 |
+
- **llama-3.2-1b** consistently leads across all tasks in both 0-shot and 5-shot settings, with top scores of **0.75** in **PIQA** and **0.64** in **BoolQ**.
|
119 |
+
- **hishab/titulm-llama-3.2-1b-v1.0** shows competitive performance but generally scores lower than **llama-3.2-1b**, particularly in the 5-shot setting.
|
120 |
+
|
121 |
+
| Model | Shots | MMLU | BoolQ | Commonsense QA | OpenBook QA | PIQA |
|
122 |
+
|--------------------------------------|--------|--------------|------------|--------------------|-----------------|-----------|
|
123 |
+
| llama-3.2-1b | 0-shot | **0.38** | **0.64** | **0.47** | **0.37** | **0.75** |
|
124 |
+
| | 5-shot | **0.309** | **0.662** | **0.317** | **0.396** | **0.759** |
|
125 |
+
| titulm-llama-3.2-1b-v1.0 | 0-shot | 0.26 | 0.63 | 0.34 | 0.35 | 0.73 |
|
126 |
+
| | 5-shot | 0.25 | 0.60 | 0.25 | 0.37 | 0.74 |
|
127 |
|
128 |
|
129 |
### Instruction Tuned Models
|
|
|
132 |
### Intended Use
|
133 |
- Bangla text generation
|
134 |
- Bangla language understanding tasks
|
135 |
+
- Bangla instruction fine-tuning tasks
|
|