Merge branch 'main' of https://huggingface.co/beomi/open-llama-2-ko

Browse files

Files changed (4) hide show

LICENSE +21 -0
README.md +126 -0
corpus/AI_HUB +50 -0
corpus/MODU_CORPUS +6 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Junbum Lee(Beomi)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+---
+language:
+- ko
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- facebook
+- meta
+- pytorch
+- llama
+- llama-2
+- kollama
+- llama-2-ko
+license: mit
+library_name: transformers
+---
+**Update Log**
+- 2023.12.14: First Release of Open-Llama-2-Ko
+# **Open-Llama-2-Ko** 🦙🇰🇷
+Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
+Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
+This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
+The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
+including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, 모두의 말뭉치](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
+Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
+## Model Details
+**Model Developers** Junbum Lee (Beomi)
+**Variations** Open-Llama-2-Ko will come in a range of parameter sizes — 7B and 13B — as well as pretrained variations.
+**Input** Models input text only.
+**Output** Models generate text only.
+**Model Architecture**
+Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
+||Training Data|Params|Content Length|GQA|Tokens|LR|
+|---|---|---|---|---|---|---|
+|Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|&#10007;|>15B*|5e<sup>-5</sup>|
+**Train Corpus**
+Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
+- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
+  - Used only `Training` part of the data.
+  - Explicitly dropped `Validation`/`Test` part of the data.
+- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
+Final JSONL dataset to trian this model is: 61GB.
+Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
+**Vocab Expansion**
+| Model Name | Vocabulary Size | Description |
+| --- | --- | --- |
+| Original Llama-2 | 32000 | Sentencepiece BPE |
+| **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
+**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
+| Model | Tokens |
+| --- | --- |
+| Llama-2 | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '씨', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '요']` |
+| Llama-2-Ko | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요']` |
+**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
+| Model | Tokens |
+| --- | --- |
+| Llama-2 | `['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']` |
+| Llama-2-Ko | `['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']` |
+# **Model Benchmark**
+## LM Eval Harness - Korean (polyglot branch)
+- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
+TBD
+## Note for oobabooga/text-generation-webui
+Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
+```python
+diff --git a/modules/models.py b/modules/models.py
+index 232d5fa..de5b7a0 100644
+--- a/modules/models.py
++++ b/modules/models.py
+@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
+                 trust_remote_code=shared.args.trust_remote_code,
+                 use_fast=False
+             )
+-        except ValueError:
++        except:
+             tokenizer = AutoTokenizer.from_pretrained(
+                 path_to_model,
+                 trust_remote_code=shared.args.trust_remote_code,
+```
+Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
+it is required to use `use_fast=True` option when initialize tokenizer.
+Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
+## Citation
+TBD
+## Acknowledgement
+- The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
+- The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).

corpus/AI_HUB ADDED Viewed

	@@ -0,0 +1,50 @@

+754M ./001.문서요약.jsonl
+397M ./006.전문분야한영.jsonl
+486M ./016.행정_문서_대상_기계독해_데이터.jsonl
+563M ./017.뉴스_기사_기계독해_데이터.jsonl
+1.2G ./018.논문자료_요약_데이터.jsonl
+88M ./019.법률,_규정_(판결서,_약관_등)_텍스트_분석_데이터.jsonl
+75M ./020.주제별_텍스트_일상_대화_데이터.jsonl
+265M ./021.도서자료_기계독해.jsonl
+30M ./021.용도별_목적대화_데이터.jsonl
+566M ./022.요약문_및_레포트_생성_데이터.jsonl
+19G ./023.전문분야_말뭉치_데이터(분야별_개체명_인식_포함).jsonl
+253M ./023.방송_콘텐츠_대본_요약_데이터.jsonl
+918M ./025.일상생활_및_구어체_한-영_번역_병렬_말뭉치_데이터.jsonl
+307M ./026.한국어-영어_번역_말뭉치_1.jsonl
+1.3G ./026.기술과학_분야_한-영_번역_병렬_말뭉치_데이터.jsonl
+309M ./027.한국어-중국어_번역_말뭉치_1.jsonl
+347M ./027.한국어-영어_번역_말뭉치_2.jsonl
+538M ./027.일상생활_및_구어체_한-중,_한-일_번역_병렬_말뭉치_데이터.jsonl
+276M ./028.한국어-중국어_번역_말뭉치_2.jsonl
+300M ./028.다국어_구어체_번역_병렬_말뭉치_데이터.jsonl
+410M ./029.한국어-일본어_번역_말뭉치.jsonl
+542K ./029.대규모_구매도서_기반_한국어_말뭉치_데이터.jsonl
+9.9G ./030.웹데이터_기반_한국어_말뭉치_데이터.jsonl
+1.4G ./031.온라인_구어체_말뭉치_데이터.jsonl
+258M ./032.방송콘텐츠_한국어-영어_번역_말뭉치.jsonl
+84M ./032.특허_분야_자동분류_데이터.jsonl
+239M ./034.방송콘텐츠_한국어-유럽어_번역_말뭉치.jsonl
+65M ./044.페르소나_대화.jsonl
+56M ./045.지식검색_대화.jsonl
+67M ./046.공감형_대화.jsonl
+85M ./049.일반상식_문장_생성_평가_데이터.jsonl
+13M ./050.발화유형(문어,구어,채팅)별_기계번역_병렬_말뭉치.jsonl
+193K ./052.기계번역_품질_검증_데이터.jsonl
+118M ./053.한국어-다국어(영어_제외)_번역_말뭉치(기술과학).jsonl
+127M ./054.한국어-다국어_번역_말뭉치(기초과학).jsonl
+67M ./055.한국어-다국어_번역_말뭉치(인문학).jsonl
+205M ./11.기계독해.jsonl
+259M ./141.한국어_멀티세션_대화.jsonl
+248M ./142.한국어_지식기반_관계_데이터.jsonl
+108M ./143.민원_업무_효율,_자동화를_위한_언어_AI_학습데이터.jsonl
+2.4G ./146.낚시성_기사_탐지_데이터.jsonl
+23M ./147.텍스트_윤리검증_데이터.jsonl
+632M ./153.기술과학_요약_데이터.jsonl
+962M ./155.산업정보_연계_주요국_특허_영-한_데이터.jsonl
+1.1G ./156.전문분야_영-한,_중-한_번역_말뭉치(식품).jsonl
+236M ./157.방송_콘텐츠_한-중,_한-일_번역_병렬_말뭉치_데이터.jsonl
+418M ./157.추상_요약_사실성_검증_데이터.jsonl
+12M ./158.시간_표현_탐지_데이터.jsonl
+17M ./159.문장_유형(추론,_예측_등)_판단_데이터.jsonl
+1.4G ./297.SNS_데이터_고도화.jsonl

corpus/MODU_CORPUS ADDED Viewed

	@@ -0,0 +1,6 @@

+일상대화말뭉치 2020, 2021
+신문 말뭉치 2020, 2021, 2022
+유사 문장 말뭉치
+문서 요약 말뭉치
+문어 말뭉치
+의미역 분석 말뭉치