Update README.md
Browse files
README.md
CHANGED
@@ -1,130 +1,86 @@
|
|
1 |
-
|
2 |
-
---
|
3 |
-
license: gemma
|
4 |
-
datasets:
|
5 |
-
- wisenut-nlp-team/llama_ko_smr
|
6 |
-
base_model:
|
7 |
-
- google/gemma-2-2b-it
|
8 |
-
tags:
|
9 |
-
- summary
|
10 |
-
- finetuned
|
11 |
-
---
|
12 |
-
|
13 |
-
# Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot
|
14 |
-
|
15 |
-
The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.
|
16 |
-
|
17 |
-
## Table of Contents
|
18 |
-
|
19 |
-
1. [ Dataset ](#dataset)
|
20 |
-
2. [ Model ](#model)
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
"
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
### About The App
|
88 |
-
|
89 |
-
TechSum is a user-friendly application designed to simplify the process of summarizing articles and text. By inputting a link from the ScienceTimes website, TechSum automatically retrieves the article and generates a concise summary. Users can also paste their own text to receive a quick summary. Additionally, TechSum allows users to save these summaries locally for easy reference. The app streamlines access to key information, making it a convenient tool for anyone seeking quick insights from technical articles or lengthy texts.
|
90 |
-
|
91 |
-
### Getting Started
|
92 |
-
|
93 |
-
1. Clone the repo
|
94 |
-
|
95 |
-
```
|
96 |
-
git clone https://github.com/swan-project/gemma_sprint_summary_news_chat_bot.git
|
97 |
-
```
|
98 |
-
|
99 |
-
2. Move to app directory
|
100 |
-
|
101 |
-
```
|
102 |
-
cd app
|
103 |
-
```
|
104 |
-
|
105 |
-
3. Install required pacakges
|
106 |
-
|
107 |
-
```
|
108 |
-
pip install -r requirements.txt
|
109 |
-
```
|
110 |
-
|
111 |
-
4. Add huggingface token in env file
|
112 |
-
|
113 |
-
```
|
114 |
-
HUGGINGFACE_TOKEN=YourKey
|
115 |
-
```
|
116 |
-
|
117 |
-
5. Run Flet App
|
118 |
-
|
119 |
-
```
|
120 |
-
flet run
|
121 |
-
```
|
122 |
-
|
123 |
-
## Demo
|
124 |
-
|
125 |
-
https://github.com/user-attachments/assets/9ab61bcd-4174-4696-a2bd-9799ba0f867d
|
126 |
-
|
127 |
-
|
128 |
-
**Link Mode**
|
129 |
-
|
130 |
-
**Text Mode**
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
license: gemma
|
4 |
+
datasets:
|
5 |
+
- wisenut-nlp-team/llama_ko_smr
|
6 |
+
base_model:
|
7 |
+
- google/gemma-2-2b-it
|
8 |
+
tags:
|
9 |
+
- summary
|
10 |
+
- finetuned
|
11 |
+
---
|
12 |
+
|
13 |
+
# Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot
|
14 |
+
|
15 |
+
The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.
|
16 |
+
|
17 |
+
## Table of Contents
|
18 |
+
|
19 |
+
1. [ Dataset ](#dataset)
|
20 |
+
2. [ Model ](#model)
|
21 |
+
|
22 |
+
|
23 |
+
## Dataset
|
24 |
+
|
25 |
+
The dataset used for this project is sourced from the Hugging Face repository, specifically from the [wisenut-nlp-team/llama_ko_smr](https://huggingface.co/datasets/wisenut-nlp-team/llama_ko_smr) collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.
|
26 |
+
|
27 |
+
Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.
|
28 |
+
|
29 |
+
Here is an example of the dataset:
|
30 |
+
|
31 |
+
```json
|
32 |
+
{
|
33 |
+
"instruction": "์ด ๊ธ์ ์ฃผ์ ๋ด์ฉ์ ์งง๊ฒ ์ค๋ช
ํด ์ฃผ์ค ์ ์์ต๋๊น?",
|
34 |
+
"input": "๋ถํ ์ฐ๊ทน์ ๋ํ ๋์ ํ๊ตฌ๋ ํด๋ฐฉ๊ณต๊ฐ์ ๋ถ์ผ๋ก ์ฌ๋ผ์ ธ ๊ฐ ์๋ง์ ์ฐ๊ทน์ธ๋ค์ ํ์ ์ ์ฐพ์๋ณด๊ณ ์ ํ๋ ๋จ์ํ ํธ๊ธฐ์ฌ์์ ์์๋์๋ค. ํด๋ฐฉ๊ณต๊ฐ์์ ํ๋ํ๋ ์ฐ๊ทน์ธ์ ๋๋ค์๊ฐ ๋ฉโค์๋ถ์ ๊ณผ์ ์ ๊ฑฐ์ณ ๋ถํ ์ฐ๊ทน๊ณ์ ์๋ฆฌ๋ฅผ ์ก์๊ธฐ ๋๋ฌธ์ด๋ค. ๊ทธ ์์๋ ๊ทน์๊ฐ ์ก์, ํจ์ธ๋, ๋ฐ์ํธ, ์กฐ์์ถ, ์ฐ์ถ๊ฐ ์ด์ํฅ, ์์์ผ, ์ ๊ณ ์ก, ๋ฌด๋๋ฏธ์ ๊ฐ ๊น์ผ์, ๊ฐํธ, ๋ฐฐ์ฐ ํฉ์ฒ , ๊น์ ์, ๋ฌธ์๋ด, ๋ง๋ด๊ฐ ์ ๋ถ์ถ ๋ฑ ๊ธฐ๋ผ์ฑ ๊ฐ์ ๋ฉค๋ฒ๋ค์ด ํฌํจ๋์ด ์์๋ค. ๊ทธ ์ซ์๋ก๋ง ๋ณธ๋ค๋ฉด ์ผ์ ๊ฐ์ ๊ธฐ ์์ธ์ ์ฐ๊ทน๊ณ๊ฐ ํต์ผ๋ก ํ์์ผ๋ก ์ฎ๊ฒจ๊ฐ ์
์ด์๋ค. ๊ทธ๋ ์ง๋ง ์ด์ ๋ถํ ์ฐ๊ทน์์ ๋ ์ด์ ๊ทธ๋ค์ ์กด์ฌ๋ฅผ ํ์ธํ๊ธฐ ์ด๋ ค์ด ์ํฉ์ด๋ค. ๊ทธ๋ค์ ๋จ์์๋ ๋ถ์์๋ ์๊ณ์์ ์์ํ ์ฌ๋ผ์ ธ๋ฒ๋ฆฐ โ์์ด๋ฒ๋ฆฐ ์ธ๋โ ๊ทธ ์์ฒด์ด๋ค. ๊ทธ๋ค์ ํ์ ์ ์ฐพ๋ ๊ฒ์ ์ฐจ๋ผ๋ฆฌ ๊ณ ๊ณ ํ์ ๊ณผ์ ๊ฐ ๋์๋ค. ๊ทธ๋ค์ด ์ญ์ฌ์ ์ ํธ์ผ๋ก ์ฌ๋ผ์ง ๊ทธ ์๋ฆฌ์ ์ค๋์ ๋ถํ ์ฐ๊ทน์ด ์ฑ์ฑ์ฒ๋ผ ์์ฉ์ ์๋ํ๊ณ ์๋ค. ์ค๋๋ ์ ๋ถํ ์ฐ๊ทน์ ๋ชจ๋๊ฐ ์ฃผ์ฒด์ฌ์ค์ฃผ์์ ์
๊ฐํ์ฌ ๋ง๋ค์ด์ง๋ ์ด๋ฅธ๋ฐ โ<์ฑํฉ๋น>์ ํ๋ช
์ฐ๊ทนโ ์ผ์์ด๋ค. 1978๋
๊ตญ๋ฆฝ์ฐ๊ทน๋จ์ <์ฑํฉ๋น> ๊ณต์ฐ์ ์ฑ๊ณผ๋ฅผ ๋ณธ๋ณด๊ธฐ๋ก ์ผ์ ๋ชจ๋ ์ฐ๊ทน์ด โ๋ฐ๋ผ ๋ฐฐ์ฐ๊ธฐโ๋ฅผ ํ๊ณ ์๊ธฐ ๋๋ฌธ์ด๋ค. ๋ถํ์ ์ฐ๊ทน๊ณผ ํฌ๊ณก์ ์ ์ ์์ ๋ด๋ ค ์๋ ๋จ์ฑ์ (ๅฎ่ฒ็) ๋ฌธํํ๋ก ์์ ๊ฐํ ์๋ค. ํ๋ช
์ฐ๊ทน <์ฑํฉ๋น>(1978)์ ๋ณธ๋ณด๊ธฐ๋ ํ๋ช
๊ฐ๊ทน <ํผ๋ฐ๋ค>(1971)์ด๋ฉฐ, ๊ทธ ๊ทผ์ ์๋ 1960๋
๋๋ถํฐ ์์๋ ๊น์ ์ผ ์ฃผ๋์ ๋ฌธํ์์ ํ๋ช
์ด ๊ฐ๋ก๋์ฌ ์๋ค. ๋ถํ ์ฐ๊ทน์ ์ฐฝ์๊ณผ ํฅ์ , ๊ทธ ๋ชจ๋ ๊ณผ์ ์์ ๊น์ ์ผ์ ๊ทธ๋ฆผ์์ ๋ง๋ฅ๋จ๋ฆฌ์ง ์์ ์ ์๋ค. ์ต๊ทผ์ ๋ฐฉ๋ฌธํ ์กฐ์ ์์ ์ํ์ดฌ์์ ์ ์๋ โ๋ฌธํ์ฑํ๋ช
์ฌ์ ๊ดโ(๊น์ ์ผ๊ด)์๋ 1960๋
๋ ์ค๋ฐ๋ถํฐ 2000๋
๋๊น์ง 40๋
๋์ ๊น์ ์ผ์ ๋ฌธํ์์ ๋ถ๋ฌธ ์ง๋๊ฐ 11,890๊ฑด์ด๋ฉฐ, ๊ทธ ์ค ๋ฌธํ์์ ๊ธฐ๊ด์ ์ง์ ๋ฐฉ๋ฌธํ์ฌ ์ง๋ํ ์ด๋ฅธ๋ฐ โํ์ง์ง๋โ๊ฐ 1,770๊ฑด์ด๋ผ๋ ์๋ดํ์ด ์์๋ค. ๋ถํ ์ฐ๊ทน์ด ๊น์ ์ผ๊ณผ ์ฃผ์ฒด์ฌ์์ด๋ผ๋ ํค์๋๋ฅผ ๋ ๋ ์กด์ฌํ ์ ์๋ค๋ ๊ฒ์ ๋จ์ ์ผ๋ก ๋งํด ์ค๋ค.",
|
35 |
+
"output": "ํด๋ฐฉ๊ณต๊ฐ์์ ํ๋ํ ๋๋ค์์ ์ฐ๊ทน์ธ์ ๋ฉยท์๋ถ์ ํตํด ๋ถํ ์ฐ๊ทน๊ณ์ ์๋ฆฌ ์ก์์ง๋ง ์ด์ ๋ถํ ์ฐ๊ทน์์ ๊ทธ๋ค์ ๋ณด๊ธฐ ์ด๋ ต๋ค. ๊ทธ ์๋ฆฌ์ ์ง๊ธ ๋ถํ ์ฐ๊ทน์ด ์๋ฆฌ ์ก์๋ค. 1978๋
๊ตญ๋ฆฝ์ฐ๊ทน๋จ <์ฑํฉ๋น> ๊ณต์ฐ์ ํ ๋๋ก ๋ถํ์ ๋ชจ๋ ์ฐ๊ทน์ โ<์ฑํฉ๋น>์ ํ๋ช
์ฐ๊ทนโ ์ผ์์ด๋ค. ๋ถํ ์ฐ๊ทน๊ณผ ํฌ๊ณก์ ๋จ์ฑ์ ๋ฌธํํ๋ก์ ๋ฌถ์ฌ์๊ณ , ๊ทธ ์์์ ๊น์ ์ผ ์ฃผ๋ ๋ฌธํ์์ ํ๋ช
์ด ์๊ณ , ๋ถํ ์ฐ๊ทน์ ์ฐฝ์๊ณผ ํฅ์ ๋ฑ ๊น์ ์ผ ํ์ ์ด ์๋ค. ๊น์ ์ผ์ ๋ฌธํ์์ ๋ถ๋ฌธ ์ง๋ ๊ธฐ๋ก์ ๋ถํ ์ฐ๊ทน์ด ๊น์ ์ผ๊ณผ ์ฃผ์ฒด์ฌ์์ ๋ ๋ ์ ์๋ ๊ฒ์ ๋ณด์ฌ์ค๋ค."
|
36 |
+
}
|
37 |
+
```
|
38 |
+
|
39 |
+
## Model
|
40 |
+
|
41 |
+
This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.
|
42 |
+
|
43 |
+
### Highlight
|
44 |
+
|
45 |
+
1. **LoRA Configuration for Model Efficiency**: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.
|
46 |
+
|
47 |
+
2. **Quantization for Memory Optimization**: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.
|
48 |
+
|
49 |
+
3. **Fine-Tuning Parameters**: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.
|
50 |
+
|
51 |
+
## Inference Example Code
|
52 |
+
|
53 |
+
```python
|
54 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments
|
55 |
+
|
56 |
+
BASE_MODEL = "google/gemma-2b-it"
|
57 |
+
FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"
|
58 |
+
|
59 |
+
finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)
|
61 |
+
|
62 |
+
pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
|
63 |
+
pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
|
64 |
+
|
65 |
+
doc=None
|
66 |
+
doc = r"๊ทธ๋ ๊ฒ ๋ฑ์ฅํ ๊ฒ์ด ์์์๊ณ๋ค. ์์๊ฐ 1์ด ๋์ ์์ง์ด๋ ํ์์ธ โ๊ณ ์ ์ง๋์โ๋ฅผ ์ด์ฉํด ์ ํํ 1์ด๋ฅผ ์ธก์ ํ๋ค. ์์ ์์ ์๋ ์ ์๋ค์ ํน์ ์๋์ง ์ํ๋ก ์๋ค. ์ด ์ํ์์ ๋ค๋ฅธ ์ํ๋ก ๋ณํํ๋ ค๋ฉด ์๋์ง๋ฅผ ๋ ์ํ์ ์ฐจ์ด๋งํผ ํก์ํ๊ฑฐ๋ ๋ฐฉ์ถํด์ผ ํ๋ค. ์ ์๊ฐ ์๋์ง๋ฅผ ์ป๊ธฐ ์ํด(๋ค๋ฅธ ์๋์ง ์ํ๋ก ๋ณํ๊ธฐ ์ํด) ์ ์๊ธฐํ๋ฅผ ํก์ํ ๋ ์ง๋์ด ๋ฐ์ํ๋๋ฐ, ์ด๊ฒ์ด ๋ฐ๋ก ๊ณ ์ ์ง๋์๋ค."
|
67 |
+
#doc = r"์ฒ๋
๋ง๋
์ง๋๋ ๋ณํ์ง ์๋ ๊ณณ์ด ์์๊น. ๊ณผํ์๋ค์ ์ฒ๋
๋ง๋
์ ๋์ด ์์ต ๋
์ด ์ง๋๋ 1์ด์ ์ค์ฐจ๋ ์์ด ์ผ์ ํ๊ฒ ํ๋ฅด๋ ์๊ณ๋ฅผ ๊ฐ๋ฐํ๊ณ ์๋ค. ์ง๊ตฌ๊ฐ ํ ๋ฐํด ์์ ํ๋ ์๊ฐ์ 1์ผ์ด๋ผ๊ณ ํ๋ค. ์ด๊ฒ์ ์ชผ๊ฐ ์๊ฐ๊ณผ ๋ถ, ์ด๋ฅผ ์ ํ๋ค. ํ์ง๋ง ์ง๊ตฌ ์์ ์๋๋ ์๊ฐ์ ๋ฐ๋ผ ๋ณํ๋ฏ๋ก ์๊ฐ์ ์ค์ฐจ๊ฐ ์๊ฒผ๋ค. ์๋ก์ด ์๊ฐ์ ์ ์๊ฐ ํ์ํด์ง ์ด์ ๋ค."
|
68 |
+
|
69 |
+
messages = [
|
70 |
+
{
|
71 |
+
"role": "user",
|
72 |
+
"content": "๋ค์ ๊ธ์ ์์ฝํด์ฃผ์ธ์:\n\n{}".format(doc)
|
73 |
+
}
|
74 |
+
]
|
75 |
+
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
76 |
+
|
77 |
+
outputs = pipe_finetuned(
|
78 |
+
prompt,
|
79 |
+
do_sample=True,
|
80 |
+
temperature=0.2,
|
81 |
+
top_k=50,
|
82 |
+
top_p=0.95,
|
83 |
+
add_special_tokens=True
|
84 |
+
)
|
85 |
+
print(outputs[0]["generated_text"][len(prompt):])
|
86 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|