PEFT
Safetensors
Korean
ko-genstruct-v0.1 / README.md
heegyu's picture
Update README.md
417e058 verified
|
raw
history blame
6.15 kB
metadata
base_model: MLP-KTLim/llama-3-Korean-Bllossom-8B
library_name: peft
license: llama3
datasets:
  - iknow-lab/ko-genstruct-v1

Ko-genstruct v0.1

Ko-genstruct๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ instruction tuning์— ํ•„์š”ํ•œ instruction์„ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์‹œํ—˜๋ฌธ์ œ์™€ ๊ธ€์“ฐ๊ธฐ ๋ฌธ์ œ ๋‘๊ฐ€์ง€ ์œ ํ˜•์˜ ์ง€์‹œ๋ฌธ์„ ์ƒ์„ฑํ•ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ Ada-instruct์™€ Genstruct๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์šฉ๋„๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ฒ€์ƒ‰ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•˜๊ธฐ
  • Instruction Tuning ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด Ko-genstruct๋กœ instruction์„ ์ƒ์„ฑ ํ›„, ๋‹ค๋ฅธ LLM์„ ์ด์šฉํ•˜์—ฌ ๋‹ต๋ณ€ ์ƒ์„ฑ

Details

์‚ฌ์šฉ๋ฐฉ๋ฒ•

์งˆ๋ฌธ ์ƒ์„ฑ

์•„๋ž˜ ์˜ˆ์ œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ, ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ์ง€์‹œ๋ฌธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œํ—˜๋ฌธ์ œ์™€ ๊ธ€์“ฐ๊ธฐ ๋ฌธ์ œ ๋‘๊ฐ€์ง€ ํ”„๋กฌํ”„ํŠธ ์œ ํ˜•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

import transformers
import peft

model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"
peft_model_id = "iknow-lab/ko-genstruct-v0.1"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto").eval()

model.load_adapter(peft_model_id, revision="epoch-1")


title = ""
text = ""

PROMPT_QA = """๋‹น์‹ ์€ ์‹œํ—˜๋ฌธ์ œ ์ถœ์ œ์œ„์›์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์ž๋ฃŒ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์‹œํ—˜๋ฌธ์ œ๋ฅผ ์ถœ์ œํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง€์‹œ์‚ฌํ•ญ์— ๋งž๋Š” ๊ฒฐ๊ณผ๋ฌผ์„ json ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

1. ์ƒ์„ฑํ•œ ๋ฌธ์ œ๋Š” ์‹ค์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์งˆ๋ฌธ์˜ ๋งํˆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(~๋ฌด์—‡์ธ๊ฐ€์š”? ~์ž‘์„ฑํ•ด์ฃผ์„ธ์š”. ~ ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ•˜์ฃ ?)
2. ๋จผ์ € ๊ณ ๋“ฑํ•™๊ต ์ˆ˜์ค€์˜ ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์œผ๋กœ ๊ณ ๋‚œ์ด๋„ ๋ฌธ์ œ๋กœ ํ–ฅ์ƒํ•ด์ฃผ์„ธ์š”. ๊ฐ ๋ฌธ์ œ๋Š” ๋ฐ˜๋“œ์‹œ ์ œ์‹œ๋œ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ด€์„ฑ์ด ์ ๋”๋ผ๋„, ์ฐฝ์˜์ ์ธ ์•„์ด๋””์–ด๋กœ ํ•ด๋‹น ์ž๋ฃŒ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”.
3. ๋ฌธ์ œ์—๋Š” ๋‹ต์•ˆ ์ž‘์„ฑ์— ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์ฃผ์–ด์ง„ ์ž๋ฃŒ์—์„œ ์ถ”์ถœํ•ด์„œ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
4. ์ถœ์ œํ•  ๋ฌธ์ œ์˜ ๊ณผ๋ชฉ ํ›„๋ณด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: ๊ธ€์“ฐ๊ธฐ, ํ•œ๊ตญ์–ด, ์˜์–ด, ์ˆ˜ํ•™, ์‚ฌํšŒ๊ณผํ•™, ๊ณผํ•™, ์—ญ์‚ฌ ๋ฌธํ™”์˜ˆ์ˆ , ๋ฒ•, ๋„๋•, ์ •์น˜, ์ข…๊ต, ์™ธ๊ตญ์–ด, ๊ฒฝ์ œ, ๊ฒฝ์˜, ์˜๋ฃŒ, ๊ณตํ•™, ์ธ๋ฌธํ•™ ๋“ฑ - ํ›„๋ณด์— ์—†์–ด๋„, ์ ์ ˆํ•œ ๊ณผ๋ชฉ์„ ์ž์œ ๋กญ๊ฒŒ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

# ์ œ๋ชฉ: {title}
# ์ž๋ฃŒ:
{text}"""

PROMPT_WRITING = """๋‹น์‹ ์€ ๊ธ€์“ฐ๊ธฐ ์‹œํ—˜๋ฌธ์ œ ์ถœ์ œ์œ„์›์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์ž๋ฃŒ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์‹œํ—˜๋ฌธ์ œ๋ฅผ ์ถœ์ œํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง€์‹œ์‚ฌํ•ญ์— ๋งž๋Š” ๊ฒฐ๊ณผ๋ฌผ์„ json ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

1. ์ƒ์„ฑํ•œ ๋ฌธ์ œ๋Š” ์‹ค์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์งˆ๋ฌธ์˜ ๋งํˆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(~๋ฌด์—‡์ธ๊ฐ€์š”? ~์ž‘์„ฑํ•ด์ฃผ์„ธ์š”. ~ ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ•˜์ฃ ?)
2. ๋จผ์ € ๊ณ ๋“ฑํ•™๊ต ์ˆ˜์ค€์˜ ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์œผ๋กœ ๊ณ ๋‚œ์ด๋„ ๋ฌธ์ œ๋กœ ํ–ฅ์ƒํ•ด์ฃผ์„ธ์š”. ๊ฐ ๋ฌธ์ œ๋Š” ๋ฐ˜๋“œ์‹œ ์ œ์‹œ๋œ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ด€์„ฑ์ด ์ ๋”๋ผ๋„, ์ฐฝ์˜์ ์ธ ์•„์ด๋””์–ด๋กœ ํ•ด๋‹น ์ž๋ฃŒ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”.
3. ๋ฌธ์ œ์—๋Š” ๊ธ€์“ฐ๊ธฐ ์ž‘์„ฑ์— ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์ฃผ์–ด์ง„ ์ž๋ฃŒ์—์„œ ์ถ”์ถœํ•ด์„œ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
4. ์ถœ์ œํ•  ๋ฌธ์ œ์˜ ์ฃผ์ œ ํ›„๋ณด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด ์ค‘์—์„œ ์ ์ ˆํ•œ ์ฃผ์ œ๋ฅผ 3๊ฐ€์ง€ ์„ ํƒํ•˜์„ธ์š”: ์ด๋ ฅ์„œ, ๋…ธ๋ž˜๊ฐ€์‚ฌ, ์‹œ ํ˜น์€ ์†Œ์„ค, ์—์„ธ์ด, ๊ทน๋ณธ, ์‹œ๋‚˜๋ฆฌ์˜ค, ์—ฌํ–‰์ผ๊ธฐ, ์—ฌํ–‰๊ณ„ํš์„œ, ์š”๋ฆฌ๋ ˆ์‹œํ”ผ, ํ•ด์„ค, ์ž๊ธฐ์†Œ๊ฐœ์„œ, ํŽธ์ง€, ์ด๋ฉ”์ผ, ๋ฆฌ๋ทฐ ๋ฐ ํ‰๊ฐ€, ์†Œ์…œ ๋ฏธ๋””์–ด ํฌ์ŠคํŠธ, ์ผ๊ธฐ, ์ฒญ์›์„œ, ํ•ญ์˜์„œ, ์‡ผํ•‘ ๋ฆฌ์ŠคํŠธ, ๋ฉ”๋ชจ, ์—ฐ๊ตฌ ๋…ผ๋ฌธ ๋ฐ ๊ณ„ํš์„œ, ๋น„์ฆˆ๋‹ˆ์Šค ๋ณด๊ณ ์„œ ๋ฐ ๊ฒŒํš์„œ, ๊ธฐ์ˆ  ๋ฌธ์„œ, ๋ฐœํ‘œ์ž๋ฃŒ, ๊ณ„์•ฝ์„œ ํ˜น์€ ๋ฒ•๋ฅ  ๋ฌธ์„œ, ํŽธ์ง‘ ๋ฐ ์ถœํŒ ๋ฌธ์„œ, ๊ด‘๊ณ  ์นดํ”ผ๋ผ์ดํŠธ, ์›น ์ฝ˜ํ…์ธ , ๋‰ด์Šค๋ ˆํ„ฐ, ์—ฐ์„ค๋ฌธ, ์ž๊ธฐ๊ณ„๋ฐœ์„œ, ๋ถ„์„๋ณด๊ณ ์„œ, ๊ธฐํš์•ˆ, ์ œ์•ˆ์„œ

# ์ œ๋ชฉ: {title}
# ์ž๋ฃŒ:
{text}"""

def generate_question(title, text, is_writing: bool = False):
    prompt=PROMPT_WRITING if is_writing else PROMPT_QA
    prompt = prompt.format(title=title, text=text)
    
    prompt = [{"content": prompt, "role": "user"}]
    inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt", add_generation_prompt=True, tokenize=False)
    inputs = inputs.strip() + "\n\n```json\n{\n  \"topic\":"
    inputs = tokenizer.encode(inputs, add_special_tokens=False, return_tensors="pt").to(model.device)
    outputs = model.generate(input_ids=inputs, max_new_tokens=256, do_sample=True, early_stopping=True, eos_token_id=128009, temperature=1.0)

    question = tokenizer.decode(outputs[0, inputs.shape[1]:], skip_special_tokens=False)

    return question

print("Question generation test")
for _ in range(5):
    question = generate_question(title, text)
    print(question)

print("Writing generation test")
for _ in range(5):
    question = generate_question(title, text, True)
    print(question)

Citation [optional]

BibTeX:

@misc{cui2023adainstructadaptinginstructiongenerators,
      title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning}, 
      author={Wanyun Cui and Qianle Wang},
      year={2023},
      eprint={2310.04484},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.04484}, 
}

@misc{Genstruct, 
      url={[https://https://huggingface.co/NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/https://huggingface.co/NousResearch/Genstruct-7B)}, 
      title={Genstruct}, 
      author={"euclaise"}
}

Model Card Authors [optional]

Framework versions

  • PEFT 0.11.0