PEFT
Safetensors
Korean
File size: 6,150 Bytes
42cc820
 
 
417e058
 
 
42cc820
 
417e058
42cc820
417e058
 
42cc820
417e058
 
 
42cc820
417e058
 
 
 
42cc820
417e058
42cc820
417e058
 
42cc820
417e058
 
 
42cc820
417e058
 
42cc820
417e058
 
42cc820
417e058
42cc820
 
417e058
 
42cc820
417e058
42cc820
417e058
 
 
 
42cc820
417e058
 
 
42cc820
417e058
42cc820
417e058
 
 
 
42cc820
417e058
 
 
42cc820
417e058
 
 
 
 
 
 
 
 
42cc820
417e058
42cc820
417e058
42cc820
417e058
 
 
 
42cc820
417e058
 
 
 
 
42cc820
 
 
 
417e058
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42cc820
 
417e058
42cc820
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
base_model: MLP-KTLim/llama-3-Korean-Bllossom-8B
library_name: peft
license: llama3
datasets:
- iknow-lab/ko-genstruct-v1
---

# Ko-genstruct v0.1

Ko-genstruct๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ instruction tuning์— ํ•„์š”ํ•œ instruction์„ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์‹œํ—˜๋ฌธ์ œ์™€ ๊ธ€์“ฐ๊ธฐ ๋ฌธ์ œ ๋‘๊ฐ€์ง€ ์œ ํ˜•์˜ ์ง€์‹œ๋ฌธ์„ ์ƒ์„ฑํ•ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
์ด ๋ชจ๋ธ์€ [Ada-instruct](https://arxiv.org/abs/2310.04484)์™€ [Genstruct](https://huggingface.co/NousResearch/Genstruct-7B)๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์šฉ๋„๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ๊ฒ€์ƒ‰ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•˜๊ธฐ
- Instruction Tuning ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด Ko-genstruct๋กœ instruction์„ ์ƒ์„ฑ ํ›„, ๋‹ค๋ฅธ LLM์„ ์ด์šฉํ•˜์—ฌ ๋‹ต๋ณ€ ์ƒ์„ฑ

## Details
- **Developed by:** [iKnow-Lab](https://github.com/iKnowLab-Projects/ko-genstruct)
- **License:** [llama3]
- **Lora-tuned from model:** [MLP-KTLim/llama-3-Korean-Bllossom-8B](https://huggingface.co/MLP-KTLim/llama-3-Korean-Bllossom-8B)

## ์‚ฌ์šฉ๋ฐฉ๋ฒ•

### ์งˆ๋ฌธ ์ƒ์„ฑ
์•„๋ž˜ ์˜ˆ์ œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ, ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ์ง€์‹œ๋ฌธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œํ—˜๋ฌธ์ œ์™€ ๊ธ€์“ฐ๊ธฐ ๋ฌธ์ œ ๋‘๊ฐ€์ง€ ํ”„๋กฌํ”„ํŠธ ์œ ํ˜•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

```python
import transformers
import peft

model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"
peft_model_id = "iknow-lab/ko-genstruct-v0.1"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto").eval()

model.load_adapter(peft_model_id, revision="epoch-1")


title = ""
text = ""

PROMPT_QA = """๋‹น์‹ ์€ ์‹œํ—˜๋ฌธ์ œ ์ถœ์ œ์œ„์›์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์ž๋ฃŒ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์‹œํ—˜๋ฌธ์ œ๋ฅผ ์ถœ์ œํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง€์‹œ์‚ฌํ•ญ์— ๋งž๋Š” ๊ฒฐ๊ณผ๋ฌผ์„ json ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

1. ์ƒ์„ฑํ•œ ๋ฌธ์ œ๋Š” ์‹ค์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์งˆ๋ฌธ์˜ ๋งํˆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(~๋ฌด์—‡์ธ๊ฐ€์š”? ~์ž‘์„ฑํ•ด์ฃผ์„ธ์š”. ~ ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ•˜์ฃ ?)
2. ๋จผ์ € ๊ณ ๋“ฑํ•™๊ต ์ˆ˜์ค€์˜ ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์œผ๋กœ ๊ณ ๋‚œ์ด๋„ ๋ฌธ์ œ๋กœ ํ–ฅ์ƒํ•ด์ฃผ์„ธ์š”. ๊ฐ ๋ฌธ์ œ๋Š” ๋ฐ˜๋“œ์‹œ ์ œ์‹œ๋œ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ด€์„ฑ์ด ์ ๋”๋ผ๋„, ์ฐฝ์˜์ ์ธ ์•„์ด๋””์–ด๋กœ ํ•ด๋‹น ์ž๋ฃŒ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”.
3. ๋ฌธ์ œ์—๋Š” ๋‹ต์•ˆ ์ž‘์„ฑ์— ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์ฃผ์–ด์ง„ ์ž๋ฃŒ์—์„œ ์ถ”์ถœํ•ด์„œ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
4. ์ถœ์ œํ•  ๋ฌธ์ œ์˜ ๊ณผ๋ชฉ ํ›„๋ณด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: ๊ธ€์“ฐ๊ธฐ, ํ•œ๊ตญ์–ด, ์˜์–ด, ์ˆ˜ํ•™, ์‚ฌํšŒ๊ณผํ•™, ๊ณผํ•™, ์—ญ์‚ฌ ๋ฌธํ™”์˜ˆ์ˆ , ๋ฒ•, ๋„๋•, ์ •์น˜, ์ข…๊ต, ์™ธ๊ตญ์–ด, ๊ฒฝ์ œ, ๊ฒฝ์˜, ์˜๋ฃŒ, ๊ณตํ•™, ์ธ๋ฌธํ•™ ๋“ฑ - ํ›„๋ณด์— ์—†์–ด๋„, ์ ์ ˆํ•œ ๊ณผ๋ชฉ์„ ์ž์œ ๋กญ๊ฒŒ ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

# ์ œ๋ชฉ: {title}
# ์ž๋ฃŒ:
{text}"""

PROMPT_WRITING = """๋‹น์‹ ์€ ๊ธ€์“ฐ๊ธฐ ์‹œํ—˜๋ฌธ์ œ ์ถœ์ œ์œ„์›์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์ž๋ฃŒ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์‹œํ—˜๋ฌธ์ œ๋ฅผ ์ถœ์ œํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง€์‹œ์‚ฌํ•ญ์— ๋งž๋Š” ๊ฒฐ๊ณผ๋ฌผ์„ json ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜ํ•ด์ฃผ์„ธ์š”.

1. ์ƒ์„ฑํ•œ ๋ฌธ์ œ๋Š” ์‹ค์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์งˆ๋ฌธ์˜ ๋งํˆฌ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(~๋ฌด์—‡์ธ๊ฐ€์š”? ~์ž‘์„ฑํ•ด์ฃผ์„ธ์š”. ~ ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ•˜์ฃ ?)
2. ๋จผ์ € ๊ณ ๋“ฑํ•™๊ต ์ˆ˜์ค€์˜ ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์œผ๋กœ ๊ณ ๋‚œ์ด๋„ ๋ฌธ์ œ๋กœ ํ–ฅ์ƒํ•ด์ฃผ์„ธ์š”. ๊ฐ ๋ฌธ์ œ๋Š” ๋ฐ˜๋“œ์‹œ ์ œ์‹œ๋œ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ด€์„ฑ์ด ์ ๋”๋ผ๋„, ์ฐฝ์˜์ ์ธ ์•„์ด๋””์–ด๋กœ ํ•ด๋‹น ์ž๋ฃŒ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”.
3. ๋ฌธ์ œ์—๋Š” ๊ธ€์“ฐ๊ธฐ ์ž‘์„ฑ์— ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์ฃผ์–ด์ง„ ์ž๋ฃŒ์—์„œ ์ถ”์ถœํ•ด์„œ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
4. ์ถœ์ œํ•  ๋ฌธ์ œ์˜ ์ฃผ์ œ ํ›„๋ณด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด ์ค‘์—์„œ ์ ์ ˆํ•œ ์ฃผ์ œ๋ฅผ 3๊ฐ€์ง€ ์„ ํƒํ•˜์„ธ์š”: ์ด๋ ฅ์„œ, ๋…ธ๋ž˜๊ฐ€์‚ฌ, ์‹œ ํ˜น์€ ์†Œ์„ค, ์—์„ธ์ด, ๊ทน๋ณธ, ์‹œ๋‚˜๋ฆฌ์˜ค, ์—ฌํ–‰์ผ๊ธฐ, ์—ฌํ–‰๊ณ„ํš์„œ, ์š”๋ฆฌ๋ ˆ์‹œํ”ผ, ํ•ด์„ค, ์ž๊ธฐ์†Œ๊ฐœ์„œ, ํŽธ์ง€, ์ด๋ฉ”์ผ, ๋ฆฌ๋ทฐ ๋ฐ ํ‰๊ฐ€, ์†Œ์…œ ๋ฏธ๋””์–ด ํฌ์ŠคํŠธ, ์ผ๊ธฐ, ์ฒญ์›์„œ, ํ•ญ์˜์„œ, ์‡ผํ•‘ ๋ฆฌ์ŠคํŠธ, ๋ฉ”๋ชจ, ์—ฐ๊ตฌ ๋…ผ๋ฌธ ๋ฐ ๊ณ„ํš์„œ, ๋น„์ฆˆ๋‹ˆ์Šค ๋ณด๊ณ ์„œ ๋ฐ ๊ฒŒํš์„œ, ๊ธฐ์ˆ  ๋ฌธ์„œ, ๋ฐœํ‘œ์ž๋ฃŒ, ๊ณ„์•ฝ์„œ ํ˜น์€ ๋ฒ•๋ฅ  ๋ฌธ์„œ, ํŽธ์ง‘ ๋ฐ ์ถœํŒ ๋ฌธ์„œ, ๊ด‘๊ณ  ์นดํ”ผ๋ผ์ดํŠธ, ์›น ์ฝ˜ํ…์ธ , ๋‰ด์Šค๋ ˆํ„ฐ, ์—ฐ์„ค๋ฌธ, ์ž๊ธฐ๊ณ„๋ฐœ์„œ, ๋ถ„์„๋ณด๊ณ ์„œ, ๊ธฐํš์•ˆ, ์ œ์•ˆ์„œ

# ์ œ๋ชฉ: {title}
# ์ž๋ฃŒ:
{text}"""

def generate_question(title, text, is_writing: bool = False):
    prompt=PROMPT_WRITING if is_writing else PROMPT_QA
    prompt = prompt.format(title=title, text=text)
    
    prompt = [{"content": prompt, "role": "user"}]
    inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt", add_generation_prompt=True, tokenize=False)
    inputs = inputs.strip() + "\n\n```json\n{\n  \"topic\":"
    inputs = tokenizer.encode(inputs, add_special_tokens=False, return_tensors="pt").to(model.device)
    outputs = model.generate(input_ids=inputs, max_new_tokens=256, do_sample=True, early_stopping=True, eos_token_id=128009, temperature=1.0)

    question = tokenizer.decode(outputs[0, inputs.shape[1]:], skip_special_tokens=False)

    return question

print("Question generation test")
for _ in range(5):
    question = generate_question(title, text)
    print(question)

print("Writing generation test")
for _ in range(5):
    question = generate_question(title, text, True)
    print(question)
```

## Citation [optional]
**BibTeX:**

```
@misc{cui2023adainstructadaptinginstructiongenerators,
      title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning}, 
      author={Wanyun Cui and Qianle Wang},
      year={2023},
      eprint={2310.04484},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.04484}, 
}

@misc{Genstruct, 
      url={[https://https://huggingface.co/NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/https://huggingface.co/NousResearch/Genstruct-7B)}, 
      title={Genstruct}, 
      author={"euclaise"}
}
```

## Model Card Authors [optional]
- ๊น€ํฌ๊ทœ (khk6435@ajou.ac.kr)

### Framework versions
- PEFT 0.11.0