metadata

language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-2
  - kollama
  - llama-2-ko
license: mit
library_name: transformers

Update Log

2023.12.14: First Release of Open-Llama-2-Ko

Open-Llama-2-Ko 🦙🇰🇷

Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.

The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus, including AI Hub, Modu Corpus, 모두의 말뭉치 and Korean Wikipedia.

Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)

Model Details

Model Developers Junbum Lee (Beomi)

Variations Open-Llama-2-Ko will come in a range of parameter sizes — 7B and 13B — as well as pretrained variations.

Input Models input text only.

Output Models generate text only.

Model Architecture

Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.

	Training Data	Params	Content Length	GQA	Tokens	LR
Llama 2	A new mix of Publicly Accessable Korean Corpus	7B	2k	✗	>15B*	5e^-5

Train Corpus

Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:

AI Hub: corpus/AI_HUB
- Used only Training part of the data.
- Explicitly dropped Validation/Test part of the data.
Modu Corpus: corpus/MODU_CORPUS

Final JSONL dataset to trian this model is: 61GB.

Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Llama-2	32000	Sentencepiece BPE
Expanded Llama-2-Ko	46336	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

Model	Tokens
Llama-2	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '씨', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '요']`
Llama-2-Ko	`['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요']`

Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"

Model	Tokens
Llama-2	`['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']`
Llama-2-Ko	`['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']`

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot

TBD

Note for oobabooga/text-generation-webui

Remove ValueError at load_tokenizer function(line 109 or near), in modules/models.py.

diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
                 trust_remote_code=shared.args.trust_remote_code,
                 use_fast=False
             )
-        except ValueError:
+        except:
             tokenizer = AutoTokenizer.from_pretrained(
                 path_to_model,
                 trust_remote_code=shared.args.trust_remote_code,

Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, it is required to use use_fast=True option when initialize tokenizer.

Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)

Citation

TBD

Acknowledgement

The training is supported by TPU Research Cloud program.
The training corpus is from AI Hub, Modu Corpus and Korean Wikipedia.