language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- kollama
- llama-2-ko
license: mit
library_name: transformers
Update Log
- 2023.12.14: First Release of Open-Llama-2-Ko
Open-Llama-2-Ko ๐ฆ๐ฐ๐ท
Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus, including AI Hub, Modu Corpus, ๋ชจ๋์ ๋ง๋ญ์น and Korean Wikipedia.
Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
Model Details
Model Developers Junbum Lee (Beomi)
Variations Open-Llama-2-Ko will come in a range of parameter sizes โ 7B and 13B โ as well as pretrained variations.
Input Models input text only.
Output Models generate text only.
Model Architecture
Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
Training Data | Params | Content Length | GQA | Tokens | LR | |
---|---|---|---|---|---|---|
Llama 2 | A new mix of Publicly Accessable Korean Corpus | 7B | 2k | โ | >15B* | 5e-5 |
Train Corpus
Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
- AI Hub: corpus/AI_HUB
- Used only
Training
part of the data. - Explicitly dropped
Validation
/Test
part of the data.
- Used only
- Modu Corpus: corpus/MODU_CORPUS
Final JSONL dataset to trian this model is: 61GB.
Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
Vocab Expansion
Model Name | Vocabulary Size | Description |
---|---|---|
Original Llama-2 | 32000 | Sentencepiece BPE |
Expanded Llama-2-Ko | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
Tokenizing "์๋ ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
Model | Tokens |
---|---|
Llama-2 | ['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์'] |
Llama-2-Ko | ['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์'] |
Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"
Model | Tokens |
---|---|
Llama-2 | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Llama-2-Ko | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Model Benchmark
LM Eval Harness - Korean (polyglot branch)
- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
TBD
Note for oobabooga/text-generation-webui
Remove ValueError
at load_tokenizer
function(line 109 or near), in modules/models.py
.
diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
trust_remote_code=shared.args.trust_remote_code,
use_fast=False
)
- except ValueError:
+ except:
tokenizer = AutoTokenizer.from_pretrained(
path_to_model,
trust_remote_code=shared.args.trust_remote_code,
Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
it is required to use use_fast=True
option when initialize tokenizer.
Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
Citation
TBD
Acknowledgement
- The training is supported by TPU Research Cloud program.
- The training corpus is from AI Hub, Modu Corpus and Korean Wikipedia.