ko-gemma-2-9b-it / README.md
davidkim205's picture
Update README.md
9c3bc62 verified
|
raw
history blame
9.41 kB
metadata
library_name: transformers
license: llama3
language:
  - ko
  - en
pipeline_tag: text-generation

davidkim205/ko-gemma-2-9b-it

davidkim205/ko-gemma-2-9b-it is one of several models being researched to improve the performance of Korean language models.

(would be released soon)

Model Details

  • Model Developers : davidkim(changyeon kim)
  • Repository : -
  • base mode : google/gemma-2-9b-it
  • sft dataset : qa_ability_1851.jsonl

Usage

Chat Template

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "davidkim205/ko-gemma-2-9b-it"

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config)

chat = [
    { "role": "system", "content":"๋‹น์‹ ์€ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๋Š” AI์ž…๋‹ˆ๋‹ค."},
    { "role": "user", "content": "๋”ฅ๋Ÿฌ๋‹์„ ์–ด๋–ป๊ฒŒ ๊ณต๋ถ€ํ•ด์•ผํ•˜๋‚˜์š”?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1024)
print(tokenizer.decode(outputs[0]))

output

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:04<00:00,  1.04s/it]
/home/david/anaconda3/envs/eval/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(
<bos>๋‹น์‹ ์€ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๋Š” AI์ž…๋‹ˆ๋‹ค.<start_of_turn>user
๋”ฅ๋Ÿฌ๋‹์„ ์–ด๋–ป๊ฒŒ ๊ณต๋ถ€ํ•ด์•ผํ•˜๋‚˜์š”?<end_of_turn>
<start_of_turn>model
๋”ฅ๋Ÿฌ๋‹์„ ๊ณต๋ถ€ํ•˜๋Š” ๊ฒƒ์€ ํฅ๋ฏธ๋กญ๊ณ  ๋ณด๋žŒ ์žˆ๋Š” ์—ฌ์ •์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! 

ํ•˜์ง€๋งŒ ์–ด๋””์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์•ผ ํ• ์ง€ ๋ง‰๋ง‰ํ•˜๊ฒŒ ๋Š๊ปด์งˆ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. 

๋‹ค์Œ์€ ๋”ฅ๋Ÿฌ๋‹์„ ๊ณต๋ถ€ํ•˜๊ธฐ ์œ„ํ•œ ๋‹จ๊ณ„๋ณ„ ๊ฐ€์ด๋“œ์ž…๋‹ˆ๋‹ค.

**1๋‹จ๊ณ„: ๊ธฐ์ดˆ ๋‹ค์ง€๊ธฐ**

* **์ˆ˜ํ•™**: ๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ์„ ํ˜•๋Œ€์ˆ˜, ๋ฏธ์ ๋ถ„, ํ™•๋ฅ  ๋ฐ ํ†ต๊ณ„์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ์ง€์‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Khan Academy, Coursera ๋“ฑ ์˜จ๋ผ์ธ ํ”Œ๋žซํผ์—์„œ ์ˆ˜ํ•™ ๊ฐ•์ขŒ๋ฅผ ๋“ฃ๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.
* **ํ”„๋กœ๊ทธ๋ž˜๋ฐ**: Python์€ ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์ž…๋‹ˆ๋‹ค. Python ๊ธฐ์ดˆ ๋ฌธ๋ฒ•, ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ, ํ•จ์ˆ˜ ๋“ฑ์„ ์ตํžˆ์„ธ์š”. Codecademy, Google's Python Class ๋“ฑ์˜ ํ”Œ๋žซํผ์—์„œ Python์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
* **๊ธฐ๋ณธ ๋จธ์‹ ๋Ÿฌ๋‹**: ๋”ฅ๋Ÿฌ๋‹์„ ์ดํ•ดํ•˜๊ธฐ ์ „์— ๊ธฐ๋ณธ์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ฐœ๋…์„ ์ตํžˆ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. 
    * ๋ถ„๋ฅ˜, ํšŒ๊ท€, ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ดํ•ดํ•˜๊ณ , Scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹ค์Šต์„ ํ•ด๋ณด์„ธ์š”.

**2๋‹จ๊ณ„: ๋”ฅ๋Ÿฌ๋‹ ๊ฐœ๋… ํ•™์Šต**

* **์˜จ๋ผ์ธ ๊ฐ•์ขŒ**: Coursera, edX, Udacity ๋“ฑ์˜ ํ”Œ๋žซํผ์—์„œ ์ œ๊ณตํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ฐ•์ขŒ๋ฅผ ์ˆ˜๊ฐ•ํ•˜์„ธ์š”. Andrew Ng์˜ Deep Learning Specialization์€ ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ํƒ„ํƒ„ํ•˜๊ฒŒ ๋‹ค์ง€๋Š” ๋ฐ ์ข‹์€ ์„ ํƒ์ž…๋‹ˆ๋‹ค.
* **์ฑ…**: ๋”ฅ๋Ÿฌ๋‹์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ์‹ฌํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ฑ…์„ ์ฝ๋Š” ๊ฒƒ๋„ ์ข‹์€ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 
    * "Deep Learning" (Ian Goodfellow, Yoshua Bengio, Aaron Courville)์€ ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์˜ ์ „๋ฌธ๊ฐ€๋ฅผ ์œ„ํ•œ ์‹ฌ๋„ ์žˆ๋Š” ์ฑ…์ž…๋‹ˆ๋‹ค. 
    * "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" (Aurรฉlien Gรฉron)์€ ์‹ค์Šต ์ค‘์‹ฌ์œผ๋กœ ๋”ฅ๋Ÿฌ๋‹์„ ๋ฐฐ์šฐ๊ณ  ์‹ถ์€ ์‚ฌ๋žŒ์—๊ฒŒ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
* **๋ธ”๋กœ๊ทธ ๋ฐ ๊ธฐ์‚ฌ**: ๋”ฅ๋Ÿฌ๋‹ ๊ด€๋ จ ์ตœ์‹  ํŠธ๋ Œ๋“œ์™€ ์—ฐ๊ตฌ ๋™ํ–ฅ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋ธ”๋กœ๊ทธ ๋ฐ ๊ธฐ์‚ฌ๋ฅผ ์ฝ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

**3๋‹จ๊ณ„: ์‹ค์Šต ๋ฐ ํ”„๋กœ์ ํŠธ ์ง„ํ–‰**

* **๋ฐ์ดํ„ฐ์…‹**: Kaggle, UCI Machine Learning Repository ๋“ฑ์˜ ํ”Œ๋žซํผ์—์„œ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์ฐพ์•„ ์‹ค์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
* **๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ**: TensorFlow, PyTorch, Keras ๋“ฑ์˜ ๋”ฅ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ณ  ํ›ˆ๋ จํ•˜์„ธ์š”.
* **ํ”„๋กœ์ ํŠธ**: ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์„ ์ ์šฉํ•˜์—ฌ ์‹ค์ œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. 
    * ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ, ์˜ˆ์ธก ๋ชจ๋ธ ๊ฐœ๋ฐœ ๋“ฑ ๋‹ค์–‘ํ•œ ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ๋”ฅ๋Ÿฌ๋‹ ์‹ค๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

**์ถ”๊ฐ€ ํŒ**

* **์ปค๋ฎค๋‹ˆํ‹ฐ ํ™œ๋™**: ๋”ฅ๋Ÿฌ๋‹ ๊ด€๋ จ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์ฐธ์—ฌํ•˜์—ฌ ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค๊ณผ ๊ต๋ฅ˜ํ•˜๊ณ  ์งˆ๋ฌธ์„ ํ•ด๋ณด์„ธ์š”.
* **๊พธ์ค€ํ•จ**: ๋”ฅ๋Ÿฌ๋‹์€ ๋ณต์žกํ•œ ๋ถ„์•ผ์ด๋ฏ€๋กœ ๊พธ์ค€ํžˆ ๊ณต๋ถ€ํ•˜๊ณ  ์‹ค์Šตํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.


<end_of_turn><eos>

Benchmark

kollm_evaluation

https://github.com/davidkim205/kollm_evaluation

Tasks Version Filter n-shot Metric Value Stderr
kobest N/A none 0 acc 0.5150 ยฑ 0.0073
none 0 f1 0.4494 ยฑ N/A
- kobest_boolq 1 none 0 acc 0.6154 ยฑ 0.0130
none 0 f1 0.5595 ยฑ N/A
- kobest_copa 1 none 0 acc 0.4710 ยฑ 0.0158
none 0 f1 0.4700 ยฑ N/A
- kobest_hellaswag 1 none 0 acc 0.3880 ยฑ 0.0218
none 0 f1 0.3832 ยฑ N/A
none 0 acc_norm 0.4780 ยฑ 0.0224
- kobest_sentineg 1 none 0 acc 0.5189 ยฑ 0.0251
none 0 f1 0.4773 ยฑ N/A
- kobest_wic 1 none 0 acc 0.4873 ยฑ 0.0141
none 0 f1 0.3276 ยฑ N/A
ko_truthfulqa 2 none 0 acc 0.3390 ยฑ 0.0166
ko_mmlu 1 none 0 acc 0.1469 ยฑ 0.0019
none 0 acc_norm 0.1469 ยฑ 0.0019
ko_hellaswag 1 none 0 acc 0.2955 ยฑ 0.0046
none 0 acc_norm 0.3535 ยฑ 0.0048
ko_common_gen 1 none 0 acc 0.5825 ยฑ 0.0126
none 0 acc_norm 0.5825 ยฑ 0.0126
ko_arc_easy 1 none 0 acc 0.2329 ยฑ 0.0124
none 0 acc_norm 0.2867 ยฑ 0.0132

Evaluation of KEval

keval is an evaluation model that learned the prompt and dataset used in the benchmark for evaluating Korean language models among various methods of evaluating models with chatgpt to compensate for the shortcomings of the existing lm-evaluation-harness.

https://huggingface.co/davidkim205/keval-7b

model ned exe_time evalscore count
claude-3-opus-20240229 nan nan 8.79 42
gpt-4-turbo-2024-04-09 nan nan 8.71 42
Qwen2-72B-Instruct nan 29850.5 7.85 42
WizardLM-2-8x22B nan 133831 7.57 42
ko-gemma-2-9b-it nan 30789.5 7.52 42
HyperClovaX nan nan 7.44 42
gemma-2-9b-it nan 23531.7 7.4 42
glm-4-9b-chat nan 24825.6 7.31 42
Ko-Llama-3-8B-Instruct nan 10697.5 6.81 42
Qwen2-7B-Instruct nan 11856.3 6.02 42
Not-WizardLM-2-7B nan 12955.7 5.26 42
gemma-1.1-7b-it nan 6950.5 4.99 42
Mistral-7B-Instruct-v0.3 nan 19631.4 4.89 42
Phi-3-small-128k-instruct nan 26747.5 3.52 42