beomi commited on
Commit
a21614e
โ€ข
2 Parent(s): 68eb804 5a0fcef

Merge branch 'main' of https://huggingface.co/beomi/open-llama-2-ko

Browse files
Files changed (4) hide show
  1. LICENSE +21 -0
  2. README.md +126 -0
  3. corpus/AI_HUB +50 -0
  4. corpus/MODU_CORPUS +6 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Junbum Lee(Beomi)
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - facebook
9
+ - meta
10
+ - pytorch
11
+ - llama
12
+ - llama-2
13
+ - kollama
14
+ - llama-2-ko
15
+ license: mit
16
+ library_name: transformers
17
+ ---
18
+
19
+ **Update Log**
20
+
21
+ - 2023.12.14: First Release of Open-Llama-2-Ko
22
+
23
+ # **Open-Llama-2-Ko** ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท
24
+
25
+ Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
26
+ Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
27
+ This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
28
+
29
+ The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
30
+ including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
31
+
32
+ Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
33
+
34
+ ## Model Details
35
+
36
+ **Model Developers** Junbum Lee (Beomi)
37
+
38
+ **Variations** Open-Llama-2-Ko will come in a range of parameter sizes โ€” 7B and 13B โ€” as well as pretrained variations.
39
+
40
+ **Input** Models input text only.
41
+
42
+ **Output** Models generate text only.
43
+
44
+ **Model Architecture**
45
+
46
+ Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
47
+
48
+ ||Training Data|Params|Content Length|GQA|Tokens|LR|
49
+ |---|---|---|---|---|---|---|
50
+ |Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|&#10007;|>15B*|5e<sup>-5</sup>|
51
+
52
+ **Train Corpus**
53
+
54
+ Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
55
+
56
+ - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
57
+ - Used only `Training` part of the data.
58
+ - Explicitly dropped `Validation`/`Test` part of the data.
59
+ - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
60
+
61
+ Final JSONL dataset to trian this model is: 61GB.
62
+
63
+ Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
64
+
65
+ **Vocab Expansion**
66
+
67
+ | Model Name | Vocabulary Size | Description |
68
+ | --- | --- | --- |
69
+ | Original Llama-2 | 32000 | Sentencepiece BPE |
70
+ | **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
71
+
72
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."**
73
+
74
+ | Model | Tokens |
75
+ | --- | --- |
76
+ | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']` |
77
+ | Llama-2-Ko | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']` |
78
+
79
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
80
+
81
+ | Model | Tokens |
82
+ | --- | --- |
83
+ | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
84
+ | Llama-2-Ko | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` |
85
+
86
+ # **Model Benchmark**
87
+
88
+ ## LM Eval Harness - Korean (polyglot branch)
89
+
90
+ - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
91
+
92
+ TBD
93
+
94
+ ## Note for oobabooga/text-generation-webui
95
+
96
+ Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
97
+
98
+ ```python
99
+ diff --git a/modules/models.py b/modules/models.py
100
+ index 232d5fa..de5b7a0 100644
101
+ --- a/modules/models.py
102
+ +++ b/modules/models.py
103
+ @@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
104
+ trust_remote_code=shared.args.trust_remote_code,
105
+ use_fast=False
106
+ )
107
+ - except ValueError:
108
+ + except:
109
+ tokenizer = AutoTokenizer.from_pretrained(
110
+ path_to_model,
111
+ trust_remote_code=shared.args.trust_remote_code,
112
+ ```
113
+
114
+ Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
115
+ it is required to use `use_fast=True` option when initialize tokenizer.
116
+
117
+ Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
118
+
119
+ ## Citation
120
+
121
+ TBD
122
+
123
+ ## Acknowledgement
124
+
125
+ - The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
126
+ - The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
corpus/AI_HUB ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 754M ./001.๋ฌธ์„œ์š”์•ฝ.jsonl
2
+ 397M ./006.์ „๋ฌธ๋ถ„์•ผํ•œ์˜.jsonl
3
+ 486M ./016.ํ–‰์ •_๋ฌธ์„œ_๋Œ€์ƒ_๊ธฐ๊ณ„๋…ํ•ด_๋ฐ์ดํ„ฐ.jsonl
4
+ 563M ./017.๋‰ด์Šค_๊ธฐ์‚ฌ_๊ธฐ๊ณ„๋…ํ•ด_๋ฐ์ดํ„ฐ.jsonl
5
+ 1.2G ./018.๋…ผ๋ฌธ์ž๋ฃŒ_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
6
+ 88M ./019.๋ฒ•๋ฅ ,_๊ทœ์ •_(ํŒ๊ฒฐ์„œ,_์•ฝ๊ด€_๋“ฑ)_ํ…์ŠคํŠธ_๋ถ„์„_๋ฐ์ดํ„ฐ.jsonl
7
+ 75M ./020.์ฃผ์ œ๋ณ„_ํ…์ŠคํŠธ_์ผ์ƒ_๋Œ€ํ™”_๋ฐ์ดํ„ฐ.jsonl
8
+ 265M ./021.๋„์„œ์ž๋ฃŒ_๊ธฐ๊ณ„๋…ํ•ด.jsonl
9
+ 30M ./021.์šฉ๋„๋ณ„_๋ชฉ์ ๋Œ€ํ™”_๋ฐ์ดํ„ฐ.jsonl
10
+ 566M ./022.์š”์•ฝ๋ฌธ_๋ฐ_๋ ˆํฌํŠธ_์ƒ์„ฑ_๋ฐ์ดํ„ฐ.jsonl
11
+ 19G ./023.์ „๋ฌธ๋ถ„์•ผ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ(๋ถ„์•ผ๋ณ„_๊ฐœ์ฒด๋ช…_์ธ์‹_ํฌํ•จ).jsonl
12
+ 253M ./023.๋ฐฉ์†ก_์ฝ˜ํ…์ธ _๋Œ€๋ณธ_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
13
+ 918M ./025.์ผ์ƒ์ƒํ™œ_๋ฐ_๊ตฌ์–ด์ฒด_ํ•œ-์˜_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
14
+ 307M ./026.ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_1.jsonl
15
+ 1.3G ./026.๊ธฐ์ˆ ๊ณผํ•™_๋ถ„์•ผ_ํ•œ-์˜_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
16
+ 309M ./027.ํ•œ๊ตญ์–ด-์ค‘๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_1.jsonl
17
+ 347M ./027.ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_2.jsonl
18
+ 538M ./027.์ผ์ƒ์ƒํ™œ_๋ฐ_๊ตฌ์–ด์ฒด_ํ•œ-์ค‘,_ํ•œ-์ผ_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
19
+ 276M ./028.ํ•œ๊ตญ์–ด-์ค‘๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_2.jsonl
20
+ 300M ./028.๋‹ค๊ตญ์–ด_๊ตฌ์–ด์ฒด_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
21
+ 410M ./029.ํ•œ๊ตญ์–ด-์ผ๋ณธ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
22
+ 542K ./029.๋Œ€๊ทœ๋ชจ_๊ตฌ๋งค๋„์„œ_๊ธฐ๋ฐ˜_ํ•œ๊ตญ์–ด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
23
+ 9.9G ./030.์›น๋ฐ์ดํ„ฐ_๊ธฐ๋ฐ˜_ํ•œ๊ตญ์–ด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
24
+ 1.4G ./031.์˜จ๋ผ์ธ_๊ตฌ์–ด์ฒด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
25
+ 258M ./032.๋ฐฉ์†ก์ฝ˜ํ…์ธ _ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
26
+ 84M ./032.ํŠนํ—ˆ_๋ถ„์•ผ_์ž๋™๋ถ„๋ฅ˜_๋ฐ์ดํ„ฐ.jsonl
27
+ 239M ./034.๋ฐฉ์†ก์ฝ˜ํ…์ธ _ํ•œ๊ตญ์–ด-์œ ๋Ÿฝ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
28
+ 65M ./044.ํŽ˜๋ฅด์†Œ๋‚˜_๋Œ€ํ™”.jsonl
29
+ 56M ./045.์ง€์‹๊ฒ€์ƒ‰_๋Œ€ํ™”.jsonl
30
+ 67M ./046.๊ณต๊ฐํ˜•_๋Œ€ํ™”.jsonl
31
+ 85M ./049.์ผ๋ฐ˜์ƒ์‹_๋ฌธ์žฅ_์ƒ์„ฑ_ํ‰๊ฐ€_๋ฐ์ดํ„ฐ.jsonl
32
+ 13M ./050.๋ฐœํ™”์œ ํ˜•(๋ฌธ์–ด,๊ตฌ์–ด,์ฑ„ํŒ…)๋ณ„_๊ธฐ๊ณ„๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜.jsonl
33
+ 193K ./052.๊ธฐ๊ณ„๋ฒˆ์—ญ_ํ’ˆ์งˆ_๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
34
+ 118M ./053.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด(์˜์–ด_์ œ์™ธ)_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(๊ธฐ์ˆ ๊ณผํ•™).jsonl
35
+ 127M ./054.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(๊ธฐ์ดˆ๊ณผํ•™).jsonl
36
+ 67M ./055.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(์ธ๋ฌธํ•™).jsonl
37
+ 205M ./11.๊ธฐ๊ณ„๋…ํ•ด.jsonl
38
+ 259M ./141.ํ•œ๊ตญ์–ด_๋ฉ€ํ‹ฐ์„ธ์…˜_๋Œ€ํ™”.jsonl
39
+ 248M ./142.ํ•œ๊ตญ์–ด_์ง€์‹๊ธฐ๋ฐ˜_๊ด€๊ณ„_๋ฐ์ดํ„ฐ.jsonl
40
+ 108M ./143.๋ฏผ์›_์—…๋ฌด_ํšจ์œจ,_์ž๋™ํ™”๋ฅผ_์œ„ํ•œ_์–ธ์–ด_AI_ํ•™์Šต๋ฐ์ดํ„ฐ.jsonl
41
+ 2.4G ./146.๋‚š์‹œ์„ฑ_๊ธฐ์‚ฌ_ํƒ์ง€_๋ฐ์ดํ„ฐ.jsonl
42
+ 23M ./147.ํ…์ŠคํŠธ_์œค๋ฆฌ๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
43
+ 632M ./153.๊ธฐ์ˆ ๊ณผํ•™_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
44
+ 962M ./155.์‚ฐ์—…์ •๋ณด_์—ฐ๊ณ„_์ฃผ์š”๊ตญ_ํŠนํ—ˆ_์˜-ํ•œ_๋ฐ์ดํ„ฐ.jsonl
45
+ 1.1G ./156.์ „๋ฌธ๋ถ„์•ผ_์˜-ํ•œ,_์ค‘-ํ•œ_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(์‹ํ’ˆ).jsonl
46
+ 236M ./157.๋ฐฉ์†ก_์ฝ˜ํ…์ธ _ํ•œ-์ค‘,_ํ•œ-์ผ_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
47
+ 418M ./157.์ถ”์ƒ_์š”์•ฝ_์‚ฌ์‹ค์„ฑ_๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
48
+ 12M ./158.์‹œ๊ฐ„_ํ‘œํ˜„_ํƒ์ง€_๋ฐ์ดํ„ฐ.jsonl
49
+ 17M ./159.๋ฌธ์žฅ_์œ ํ˜•(์ถ”๋ก ,_์˜ˆ์ธก_๋“ฑ)_ํŒ๋‹จ_๋ฐ์ดํ„ฐ.jsonl
50
+ 1.4G ./297.SNS_๋ฐ์ดํ„ฐ_๊ณ ๋„ํ™”.jsonl
corpus/MODU_CORPUS ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ์ผ์ƒ๋Œ€ํ™”๋ง๋ญ‰์น˜ 2020, 2021
2
+ ์‹ ๋ฌธ ๋ง๋ญ‰์น˜ 2020, 2021, 2022
3
+ ์œ ์‚ฌ ๋ฌธ์žฅ ๋ง๋ญ‰์น˜
4
+ ๋ฌธ์„œ ์š”์•ฝ ๋ง๋ญ‰์น˜
5
+ ๋ฌธ์–ด ๋ง๋ญ‰์น˜
6
+ ์˜๋ฏธ์—ญ ๋ถ„์„ ๋ง๋ญ‰์น˜