File size: 1,615 Bytes
2ca4116 f5b62ff 2ca4116 d0d57ef 2ca4116 d0d57ef 09cd401 d0d57ef 09cd401 2ca4116 f5b62ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
---
license: apache-2.0
language:
- ko
- en
tags:
- h2o-danube2
- korean
- sLLM
- llm
datasets:
- uonlp/CulturaX
base_model:
- h2oai/h2o-danube2-1.8b-base
---
## Model Details
danube-ko-1.8b-base is a continual pre-trained Korean language model based on [h2oai/h2o-danube2-1.8b-base](https://huggingface.co/h2oai/h2o-danube2-1.8b-base).
## Model Developers
Jinhong Jeong, Ungsang Yoon
## Model Architecture
The vocabulary size was expanded from original 32000 to 40000 to add Korean tokens efficiently. We used the [EEVE](https://arxiv.org/abs/2402.14714) technique for training. The model has sequence length of 2048. Everything else is the same as the original model.
## Training Datasets
We used CulturaX, Common Crawl CC-MAIN-2024-10, AI Hub Data, Korean Wikis, Corpora from National Institute of the Korean Language, Standard Korean Dictionary, etc. About 42GB of data was used for training.
## Model Benchmark
This model is ranked #1 in Ko-MMLU on the [Open Ko-LLM Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard) among pretrained Korean models of size 2B or smaller as of July 5, 2024.
| Task | Value |
| --- | --- |
| Ko-ARC | 31.74 |
| Ko-HellaSwag | 44.44 |
| Ko-MMLU | 28.06 |
| Ko-TruthfulQA | 41.63 |
| Ko-CommonGen V2 | 32.7 |
| kmmlu_direct | 29.05 |
| kobest | 59.13 |
## Disclaimer
The Model can generate information that is biased, discriminatory, socially inappropriate, etc. The Model can also generate information that is not accurate. The Model is used at your own risk, and the developers are not responsible for the information generated by the model. |