Update README.md
#3
by
Taekyoon
- opened
README.md
CHANGED
@@ -20,7 +20,7 @@ tags:
|
|
20 |
**Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
|
21 |
|
22 |
This model card corresponds to the 7B base version of the **Gemma-Mling** model,
|
23 |
-
continual pretrained on Korean/English/Chinese/Japanese corpus.
|
24 |
|
25 |
**Resources and Technical Documentation**:
|
26 |
|
@@ -96,6 +96,20 @@ Details about the model internals.
|
|
96 |
|
97 |
Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
## Evaluation
|
101 |
|
|
|
20 |
**Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
|
21 |
|
22 |
This model card corresponds to the 7B base version of the **Gemma-Mling** model,
|
23 |
+
continual pretrained on mainly Korean/English/Chinese/Japanese + 500 multilingual corpus.
|
24 |
|
25 |
**Resources and Technical Documentation**:
|
26 |
|
|
|
96 |
|
97 |
Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
|
98 |
|
99 |
+
### Dataset
|
100 |
+
|
101 |
+
We trained a mixture of multiple language datasets and trained until 100B.
|
102 |
+
The released model is the best performance model based on our Evaluation below from model checkpoints.
|
103 |
+
|
104 |
+
For Korean and English datasets, we utilized sampled llama2ko training dataset which combined 1:1 ratio in each language.
|
105 |
+
|
106 |
+
| Dataset | Jsonl (GB) | Sampled |
|
107 |
+
|--------------------------|------------|---------|
|
108 |
+
| range3/cc100-ja | 96.39 | No |
|
109 |
+
| Skywork/SkyPile-150B | 100.57 | Yes |
|
110 |
+
| llama2ko dataset (ko/en) | 108.5 | Yes |
|
111 |
+
| cis-lmu/Glot500 | 181.24 | No |
|
112 |
+
| Total | 486.7 | . |
|
113 |
|
114 |
## Evaluation
|
115 |
|