swanii commited on
Commit
51121c3
โ€ข
1 Parent(s): c1f11ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -130
README.md CHANGED
@@ -1,130 +1,86 @@
1
-
2
- ---
3
- license: gemma
4
- datasets:
5
- - wisenut-nlp-team/llama_ko_smr
6
- base_model:
7
- - google/gemma-2-2b-it
8
- tags:
9
- - summary
10
- - finetuned
11
- ---
12
-
13
- # Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot
14
-
15
- The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.
16
-
17
- ## Table of Contents
18
-
19
- 1. [ Dataset ](#dataset)
20
- 2. [ Model ](#model)
21
- 3. [ Project Structure ](#project-structure)
22
- 4. [ App ](#app)
23
- 5. [_Demo_](#demo)
24
-
25
- ## Dataset
26
-
27
- The dataset used for this project is sourced from the Hugging Face repository, specifically from the [wisenut-nlp-team/llama_ko_smr](https://huggingface.co/datasets/wisenut-nlp-team/llama_ko_smr) collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.
28
-
29
- Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.
30
-
31
- Here is an example of the dataset:
32
-
33
- ```json
34
- {
35
- "instruction": "์ด ๊ธ€์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ์งง๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?",
36
- "input": "๋ถํ•œ ์—ฐ๊ทน์— ๋Œ€ํ•œ ๋‚˜์˜ ํƒ๊ตฌ๋Š” ํ•ด๋ฐฉ๊ณต๊ฐ„์— ๋ถ์œผ๋กœ ์‚ฌ๋ผ์ ธ ๊ฐ„ ์ˆ˜๋งŽ์€ ์—ฐ๊ทน์ธ๋“ค์˜ ํ–‰์ ์„ ์ฐพ์•„๋ณด๊ณ ์ž ํ•˜๋Š” ๋‹จ์ˆœํ•œ ํ˜ธ๊ธฐ์‹ฌ์—์„œ ์‹œ์ž‘๋˜์—ˆ๋‹ค. ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•˜๋˜ ์—ฐ๊ทน์ธ์˜ ๋Œ€๋‹ค์ˆ˜๊ฐ€ ๋‚ฉโ€ค์›”๋ถ์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ๋ฅผ ์žก์•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ ์•ˆ์—๋Š” ๊ทน์ž‘๊ฐ€ ์†ก์˜, ํ•จ์„ธ๋•, ๋ฐ•์˜ํ˜ธ, ์กฐ์˜์ถœ, ์—ฐ์ถœ๊ฐ€ ์ด์„œํ–ฅ, ์•ˆ์˜์ผ, ์‹ ๊ณ ์†ก, ๋ฌด๋Œ€๋ฏธ์ˆ ๊ฐ€ ๊น€์ผ์˜, ๊ฐ•ํ˜ธ, ๋ฐฐ์šฐ ํ™ฉ์ฒ , ๊น€์„ ์˜, ๋ฌธ์˜ˆ๋ด‰, ๋งŒ๋‹ด๊ฐ€ ์‹ ๋ถˆ์ถœ ๋“ฑ ๊ธฐ๋ผ์„ฑ ๊ฐ™์€ ๋ฉค๋ฒ„๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์—ˆ๋‹ค. ๊ทธ ์ˆซ์ž๋กœ๋งŒ ๋ณธ๋‹ค๋ฉด ์ผ์ œ๊ฐ•์ ๊ธฐ ์„œ์šธ์˜ ์—ฐ๊ทน๊ณ„๊ฐ€ ํ†ต์œผ๋กœ ํ‰์–‘์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ„ ์…ˆ์ด์—ˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๋” ์ด์ƒ ๊ทธ๋“ค์˜ ์กด์žฌ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋‹ค. ๊ทธ๋“ค์€ ๋‚จ์—์„œ๋„ ๋ถ์—์„œ๋„ ์‹œ๊ณ„์—์„œ ์˜์›ํžˆ ์‚ฌ๋ผ์ ธ๋ฒ„๋ฆฐ โ€˜์žƒ์–ด๋ฒ„๋ฆฐ ์„ธ๋Œ€โ€™ ๊ทธ ์ž์ฒด์ด๋‹ค. ๊ทธ๋“ค์˜ ํ”์ ์„ ์ฐพ๋Š” ๊ฒƒ์€ ์ฐจ๋ผ๋ฆฌ ๊ณ ๊ณ ํ•™์˜ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ทธ๋“ค์ด ์—ญ์‚ฌ์˜ ์ €ํŽธ์œผ๋กœ ์‚ฌ๋ผ์ง„ ๊ทธ ์ž๋ฆฌ์— ์˜ค๋Š˜์˜ ๋ถํ•œ ์—ฐ๊ทน์ด ์„ฑ์ฑ„์ฒ˜๋Ÿผ ์œ„์šฉ์„ ์ž๋ž‘ํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ ๋ถํ•œ ์—ฐ๊ทน์€ ๋ชจ๋‘๊ฐ€ ์ฃผ์ฒด์‚ฌ์‹ค์ฃผ์˜์— ์ž…๊ฐํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง€๋Š” ์ด๋ฅธ๋ฐ” โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ์˜ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์˜ ์„ฑ๊ณผ๋ฅผ ๋ณธ๋ณด๊ธฐ๋กœ ์‚ผ์•„ ๋ชจ๋“  ์—ฐ๊ทน์ด โ€˜๋”ฐ๋ผ ๋ฐฐ์šฐ๊ธฐโ€™๋ฅผ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ถํ•œ์˜ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ์ •์ ์—์„œ ๋‚ด๋ ค ์Ÿ๋Š” ๋‹จ์„ฑ์ (ๅ–ฎ่ฒ็š„) ๋ฌธํ™”ํšŒ๋กœ ์•ˆ์— ๊ฐ‡ํ˜€ ์žˆ๋‹ค. ํ˜๋ช…์—ฐ๊ทน <์„ฑํ™ฉ๋‹น>(1978)์˜ ๋ณธ๋ณด๊ธฐ๋Š” ํ˜๋ช…๊ฐ€๊ทน <ํ”ผ๋ฐ”๋‹ค>(1971)์ด๋ฉฐ, ๊ทธ ๊ทผ์ €์—๋Š” 1960๋…„๋Œ€๋ถ€ํ„ฐ ์‹œ์ž‘๋œ ๊น€์ •์ผ ์ฃผ๋„์˜ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ๊ฐ€๋กœ๋†“์—ฌ ์žˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ , ๊ทธ ๋ชจ๋“  ๊ณผ์ •์—์„œ ๊น€์ •์ผ์˜ ๊ทธ๋ฆผ์ž์— ๋งž๋‹ฅ๋œจ๋ฆฌ์ง€ ์•Š์„ ์ˆ˜ ์—†๋‹ค. ์ตœ๊ทผ์— ๋ฐฉ๋ฌธํ•œ ์กฐ์„ ์˜ˆ์ˆ ์˜ํ™”์ดฌ์˜์†Œ ์— ์žˆ๋Š” โ€˜๋ฌธํ™”์„ฑํ˜๋ช…์‚ฌ์ ๊ด€โ€™(๊น€์ •์ผ๊ด€)์—๋Š” 1960๋…„๋Œ€ ์ค‘๋ฐ˜๋ถ€ํ„ฐ 2000๋…„๋Œ€๊นŒ์ง€ 40๋…„ ๋™์•ˆ ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„๊ฐ€ 11,890๊ฑด์ด๋ฉฐ, ๊ทธ ์ค‘ ๋ฌธํ™”์˜ˆ์ˆ ๊ธฐ๊ด€์„ ์ง์ ‘ ๋ฐฉ๋ฌธํ•˜์—ฌ ์ง€๋„ํ•œ ์ด๋ฅธ๋ฐ” โ€˜ํ˜„์ง€์ง€๋„โ€™๊ฐ€ 1,770๊ฑด์ด๋ผ๋Š” ์•ˆ๋‚ดํŒ์ด ์žˆ์—ˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์ด๋ผ๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ๋– ๋‚˜ ์กด์žฌํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋‹จ์ ์œผ๋กœ ๋งํ•ด ์ค€๋‹ค.",
37
- "output": "ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•œ ๋Œ€๋‹ค์ˆ˜์˜ ์—ฐ๊ทน์ธ์€ ๋‚ฉยท์›”๋ถ์„ ํ†ตํ•ด ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ ์žก์•˜์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๊ทธ๋“ค์„ ๋ณด๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ ์ž๋ฆฌ์— ์ง€๊ธˆ ๋ถํ•œ ์—ฐ๊ทน์ด ์ž๋ฆฌ ์žก์•˜๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์„ ํ† ๋Œ€๋กœ ๋ถํ•œ์˜ ๋ชจ๋“  ์—ฐ๊ทน์€ โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. ๋ถํ•œ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ๋‹จ์„ฑ์  ๋ฌธํ™”ํšŒ๋กœ์— ๋ฌถ์—ฌ์žˆ๊ณ , ๊ทธ ์‹œ์ž‘์€ ๊น€์ •์ผ ์ฃผ๋„ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ์žˆ๊ณ , ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ  ๋“ฑ ๊น€์ •์ผ ํ”์ ์ด ์žˆ๋‹ค. ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„ ๊ธฐ๋ก์€ ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์„ ๋– ๋‚  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค."
38
- }
39
- ```
40
-
41
- ## Model
42
-
43
- This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.
44
-
45
- ### Highlight
46
-
47
- 1. **LoRA Configuration for Model Efficiency**: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.
48
-
49
- 2. **Quantization for Memory Optimization**: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.
50
-
51
- 3. **Fine-Tuning Parameters**: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.
52
-
53
- ### Example
54
- **input**
55
- ```
56
- ๊ทธ๋ ‡๊ฒŒ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ์›์ž์‹œ๊ณ„๋‹ค. ์›์ž๊ฐ€ 1์ดˆ ๋™์•ˆ ์›€์ง์ด๋Š” ํšŸ์ˆ˜์ธ โ€˜๊ณ ์œ ์ง„๋™์ˆ˜โ€™๋ฅผ ์ด์šฉํ•ด ์ •ํ™•ํ•œ 1์ดˆ๋ฅผ ์ธก์ •ํ•œ๋‹ค. ์›์ž ์†์— ์žˆ๋Š” ์ „์ž๋“ค์€ ํŠน์ • ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ์žˆ๋‹ค. ์ด ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ๋ณ€ํ™”ํ•˜๋ ค๋ฉด ์—๋„ˆ์ง€๋ฅผ ๋‘ ์ƒํƒœ์˜ ์ฐจ์ด๋งŒํผ ํก์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋ฐฉ์ถœํ•ด์•ผ ํ•œ๋‹ค. ์ „์ž๊ฐ€ ์—๋„ˆ์ง€๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด(๋‹ค๋ฅธ ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด) ์ „์ž๊ธฐํŒŒ๋ฅผ ํก์ˆ˜ํ•  ๋•Œ ์ง„๋™์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์ด ๋ฐ”๋กœ ๊ณ ์œ ์ง„๋™์ˆ˜๋‹ค.
57
- ```
58
-
59
- **output**
60
- ```
61
- ์›์ž์‹œ๊ณ„๋Š” ์›์ž๊ฐ€ 1์ดˆ ๋™์•ˆ ์›€์ง์ด๋Š” ํšŸ์ˆ˜์ธ ๊ณ ์œ ์ง„๋™์ˆ˜๋ฅผ ์ด์šฉํ•ด ์ •ํ™•ํ•œ 1์ดˆ๋ฅผ ์ธก์ •ํ•œ๋‹ค.
62
- ```
63
-
64
-
65
- ## Project Structure
66
-
67
- ```
68
- gemma_sprint_summary_news_chat_bot/
69
- โ”œโ”€โ”€ app/
70
- โ”‚ โ”œโ”€โ”€ assets
71
- โ”‚ โ”œโ”€โ”€ components
72
- โ”‚ โ”œโ”€โ”€ constants
73
- โ”‚ โ”œโ”€โ”€ modules
74
- โ”‚ โ”œโ”€โ”€ local_storage.json
75
- โ”‚ โ”œโ”€โ”€ main.py
76
- โ”‚ โ””โ”€โ”€ requirements.txt
77
- โ”œโ”€โ”€ train/
78
- โ”‚ โ”œโ”€โ”€ test
79
- โ”‚ โ”‚ โ””โ”€โ”€ basic_model_inference.py
80
- โ”‚ โ””โ”€โ”€ train.py
81
- โ”œโ”€โ”€ inference.py
82
- โ””โ”€โ”€ README.md
83
- ```
84
-
85
- ## App
86
-
87
- ### About The App
88
-
89
- TechSum is a user-friendly application designed to simplify the process of summarizing articles and text. By inputting a link from the ScienceTimes website, TechSum automatically retrieves the article and generates a concise summary. Users can also paste their own text to receive a quick summary. Additionally, TechSum allows users to save these summaries locally for easy reference. The app streamlines access to key information, making it a convenient tool for anyone seeking quick insights from technical articles or lengthy texts.
90
-
91
- ### Getting Started
92
-
93
- 1. Clone the repo
94
-
95
- ```
96
- git clone https://github.com/swan-project/gemma_sprint_summary_news_chat_bot.git
97
- ```
98
-
99
- 2. Move to app directory
100
-
101
- ```
102
- cd app
103
- ```
104
-
105
- 3. Install required pacakges
106
-
107
- ```
108
- pip install -r requirements.txt
109
- ```
110
-
111
- 4. Add huggingface token in env file
112
-
113
- ```
114
- HUGGINGFACE_TOKEN=YourKey
115
- ```
116
-
117
- 5. Run Flet App
118
-
119
- ```
120
- flet run
121
- ```
122
-
123
- ## Demo
124
-
125
- https://github.com/user-attachments/assets/9ab61bcd-4174-4696-a2bd-9799ba0f867d
126
-
127
-
128
- **Link Mode**
129
-
130
- **Text Mode**
 
1
+
2
+ ---
3
+ license: gemma
4
+ datasets:
5
+ - wisenut-nlp-team/llama_ko_smr
6
+ base_model:
7
+ - google/gemma-2-2b-it
8
+ tags:
9
+ - summary
10
+ - finetuned
11
+ ---
12
+
13
+ # Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot
14
+
15
+ The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.
16
+
17
+ ## Table of Contents
18
+
19
+ 1. [ Dataset ](#dataset)
20
+ 2. [ Model ](#model)
21
+
22
+
23
+ ## Dataset
24
+
25
+ The dataset used for this project is sourced from the Hugging Face repository, specifically from the [wisenut-nlp-team/llama_ko_smr](https://huggingface.co/datasets/wisenut-nlp-team/llama_ko_smr) collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.
26
+
27
+ Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.
28
+
29
+ Here is an example of the dataset:
30
+
31
+ ```json
32
+ {
33
+ "instruction": "์ด ๊ธ€์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ์งง๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?",
34
+ "input": "๋ถํ•œ ์—ฐ๊ทน์— ๋Œ€ํ•œ ๋‚˜์˜ ํƒ๊ตฌ๋Š” ํ•ด๋ฐฉ๊ณต๊ฐ„์— ๋ถ์œผ๋กœ ์‚ฌ๋ผ์ ธ ๊ฐ„ ์ˆ˜๋งŽ์€ ์—ฐ๊ทน์ธ๋“ค์˜ ํ–‰์ ์„ ์ฐพ์•„๋ณด๊ณ ์ž ํ•˜๋Š” ๋‹จ์ˆœํ•œ ํ˜ธ๊ธฐ์‹ฌ์—์„œ ์‹œ์ž‘๋˜์—ˆ๋‹ค. ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•˜๋˜ ์—ฐ๊ทน์ธ์˜ ๋Œ€๋‹ค์ˆ˜๊ฐ€ ๋‚ฉโ€ค์›”๋ถ์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ๋ฅผ ์žก์•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ ์•ˆ์—๋Š” ๊ทน์ž‘๊ฐ€ ์†ก์˜, ํ•จ์„ธ๋•, ๋ฐ•์˜ํ˜ธ, ์กฐ์˜์ถœ, ์—ฐ์ถœ๊ฐ€ ์ด์„œํ–ฅ, ์•ˆ์˜์ผ, ์‹ ๊ณ ์†ก, ๋ฌด๋Œ€๋ฏธ์ˆ ๊ฐ€ ๊น€์ผ์˜, ๊ฐ•ํ˜ธ, ๋ฐฐ์šฐ ํ™ฉ์ฒ , ๊น€์„ ์˜, ๋ฌธ์˜ˆ๋ด‰, ๋งŒ๋‹ด๊ฐ€ ์‹ ๋ถˆ์ถœ ๋“ฑ ๊ธฐ๋ผ์„ฑ ๊ฐ™์€ ๋ฉค๋ฒ„๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์—ˆ๋‹ค. ๊ทธ ์ˆซ์ž๋กœ๋งŒ ๋ณธ๋‹ค๋ฉด ์ผ์ œ๊ฐ•์ ๊ธฐ ์„œ์šธ์˜ ์—ฐ๊ทน๊ณ„๊ฐ€ ํ†ต์œผ๋กœ ํ‰์–‘์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ„ ์…ˆ์ด์—ˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๋” ์ด์ƒ ๊ทธ๋“ค์˜ ์กด์žฌ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋‹ค. ๊ทธ๋“ค์€ ๋‚จ์—์„œ๋„ ๋ถ์—์„œ๋„ ์‹œ๊ณ„์—์„œ ์˜์›ํžˆ ์‚ฌ๋ผ์ ธ๋ฒ„๋ฆฐ โ€˜์žƒ์–ด๋ฒ„๋ฆฐ ์„ธ๋Œ€โ€™ ๊ทธ ์ž์ฒด์ด๋‹ค. ๊ทธ๋“ค์˜ ํ”์ ์„ ์ฐพ๋Š” ๊ฒƒ์€ ์ฐจ๋ผ๋ฆฌ ๊ณ ๊ณ ํ•™์˜ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ทธ๋“ค์ด ์—ญ์‚ฌ์˜ ์ €ํŽธ์œผ๋กœ ์‚ฌ๋ผ์ง„ ๊ทธ ์ž๋ฆฌ์— ์˜ค๋Š˜์˜ ๋ถํ•œ ์—ฐ๊ทน์ด ์„ฑ์ฑ„์ฒ˜๋Ÿผ ์œ„์šฉ์„ ์ž๋ž‘ํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ ๋ถํ•œ ์—ฐ๊ทน์€ ๋ชจ๋‘๊ฐ€ ์ฃผ์ฒด์‚ฌ์‹ค์ฃผ์˜์— ์ž…๊ฐํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง€๋Š” ์ด๋ฅธ๋ฐ” โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ์˜ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์˜ ์„ฑ๊ณผ๋ฅผ ๋ณธ๋ณด๊ธฐ๋กœ ์‚ผ์•„ ๋ชจ๋“  ์—ฐ๊ทน์ด โ€˜๋”ฐ๋ผ ๋ฐฐ์šฐ๊ธฐโ€™๋ฅผ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ถํ•œ์˜ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ์ •์ ์—์„œ ๋‚ด๋ ค ์Ÿ๋Š” ๋‹จ์„ฑ์ (ๅ–ฎ่ฒ็š„) ๋ฌธํ™”ํšŒ๋กœ ์•ˆ์— ๊ฐ‡ํ˜€ ์žˆ๋‹ค. ํ˜๋ช…์—ฐ๊ทน <์„ฑํ™ฉ๋‹น>(1978)์˜ ๋ณธ๋ณด๊ธฐ๋Š” ํ˜๋ช…๊ฐ€๊ทน <ํ”ผ๋ฐ”๋‹ค>(1971)์ด๋ฉฐ, ๊ทธ ๊ทผ์ €์—๋Š” 1960๋…„๋Œ€๋ถ€ํ„ฐ ์‹œ์ž‘๋œ ๊น€์ •์ผ ์ฃผ๋„์˜ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ๊ฐ€๋กœ๋†“์—ฌ ์žˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ , ๊ทธ ๋ชจ๋“  ๊ณผ์ •์—์„œ ๊น€์ •์ผ์˜ ๊ทธ๋ฆผ์ž์— ๋งž๋‹ฅ๋œจ๋ฆฌ์ง€ ์•Š์„ ์ˆ˜ ์—†๋‹ค. ์ตœ๊ทผ์— ๋ฐฉ๋ฌธํ•œ ์กฐ์„ ์˜ˆ์ˆ ์˜ํ™”์ดฌ์˜์†Œ ์— ์žˆ๋Š” โ€˜๋ฌธํ™”์„ฑํ˜๋ช…์‚ฌ์ ๊ด€โ€™(๊น€์ •์ผ๊ด€)์—๋Š” 1960๋…„๋Œ€ ์ค‘๋ฐ˜๋ถ€ํ„ฐ 2000๋…„๋Œ€๊นŒ์ง€ 40๋…„ ๋™์•ˆ ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„๊ฐ€ 11,890๊ฑด์ด๋ฉฐ, ๊ทธ ์ค‘ ๋ฌธํ™”์˜ˆ์ˆ ๊ธฐ๊ด€์„ ์ง์ ‘ ๋ฐฉ๋ฌธํ•˜์—ฌ ์ง€๋„ํ•œ ์ด๋ฅธ๋ฐ” โ€˜ํ˜„์ง€์ง€๋„โ€™๊ฐ€ 1,770๊ฑด์ด๋ผ๋Š” ์•ˆ๋‚ดํŒ์ด ์žˆ์—ˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์ด๋ผ๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ๋– ๋‚˜ ์กด์žฌํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋‹จ์ ์œผ๋กœ ๋งํ•ด ์ค€๋‹ค.",
35
+ "output": "ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•œ ๋Œ€๋‹ค์ˆ˜์˜ ์—ฐ๊ทน์ธ์€ ๋‚ฉยท์›”๋ถ์„ ํ†ตํ•ด ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ ์žก์•˜์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๊ทธ๋“ค์„ ๋ณด๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ ์ž๋ฆฌ์— ์ง€๊ธˆ ๋ถํ•œ ์—ฐ๊ทน์ด ์ž๋ฆฌ ์žก์•˜๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์„ ํ† ๋Œ€๋กœ ๋ถํ•œ์˜ ๋ชจ๋“  ์—ฐ๊ทน์€ โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. ๋ถํ•œ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ๋‹จ์„ฑ์  ๋ฌธํ™”ํšŒ๋กœ์— ๋ฌถ์—ฌ์žˆ๊ณ , ๊ทธ ์‹œ์ž‘์€ ๊น€์ •์ผ ์ฃผ๋„ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ์žˆ๊ณ , ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ  ๋“ฑ ๊น€์ •์ผ ํ”์ ์ด ์žˆ๋‹ค. ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„ ๊ธฐ๋ก์€ ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์„ ๋– ๋‚  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค."
36
+ }
37
+ ```
38
+
39
+ ## Model
40
+
41
+ This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.
42
+
43
+ ### Highlight
44
+
45
+ 1. **LoRA Configuration for Model Efficiency**: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.
46
+
47
+ 2. **Quantization for Memory Optimization**: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.
48
+
49
+ 3. **Fine-Tuning Parameters**: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.
50
+
51
+ ## Inference Example Code
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments
55
+
56
+ BASE_MODEL = "google/gemma-2b-it"
57
+ FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"
58
+
59
+ finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
60
+ tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)
61
+
62
+ pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
63
+ pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
64
+
65
+ doc=None
66
+ doc = r"๊ทธ๋ ‡๊ฒŒ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ์›์ž์‹œ๊ณ„๋‹ค. ์›์ž๊ฐ€ 1์ดˆ ๋™์•ˆ ์›€์ง์ด๋Š” ํšŸ์ˆ˜์ธ โ€˜๊ณ ์œ ์ง„๋™์ˆ˜โ€™๋ฅผ ์ด์šฉํ•ด ์ •ํ™•ํ•œ 1์ดˆ๋ฅผ ์ธก์ •ํ•œ๋‹ค. ์›์ž ์†์— ์žˆ๋Š” ์ „์ž๋“ค์€ ํŠน์ • ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ์žˆ๋‹ค. ์ด ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ๋ณ€ํ™”ํ•˜๋ ค๋ฉด ์—๋„ˆ์ง€๋ฅผ ๋‘ ์ƒํƒœ์˜ ์ฐจ์ด๋งŒํผ ํก์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋ฐฉ์ถœํ•ด์•ผ ํ•œ๋‹ค. ์ „์ž๊ฐ€ ์—๋„ˆ์ง€๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด(๋‹ค๋ฅธ ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด) ์ „์ž๊ธฐํŒŒ๋ฅผ ํก์ˆ˜ํ•  ๋•Œ ์ง„๋™์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์ด ๋ฐ”๋กœ ๊ณ ์œ ์ง„๋™์ˆ˜๋‹ค."
67
+ #doc = r"์ฒœ๋…„๋งŒ๋…„ ์ง€๋‚˜๋„ ๋ณ€ํ•˜์ง€ ์•Š๋Š” ๊ณณ์ด ์žˆ์„๊นŒ. ๊ณผํ•™์ž๋“ค์€ ์ฒœ๋…„๋งŒ๋…„์„ ๋„˜์–ด ์ˆ˜์–ต ๋…„์ด ์ง€๋‚˜๋„ 1์ดˆ์˜ ์˜ค์ฐจ๋„ ์—†์ด ์ผ์ •ํ•˜๊ฒŒ ํ๋ฅด๋Š” ์‹œ๊ณ„๋ฅผ ๊ฐœ๋ฐœํ•˜๊ณ  ์žˆ๋‹ค. ์ง€๊ตฌ๊ฐ€ ํ•œ ๋ฐ”ํ€ด ์ž์ „ํ•˜๋Š” ์‹œ๊ฐ„์„ 1์ผ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๊ฒƒ์„ ์ชผ๊ฐœ ์‹œ๊ฐ„๊ณผ ๋ถ„, ์ดˆ๋ฅผ ์ •ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ตฌ ์ž์ „ ์†๋„๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ•˜๋ฏ€๋กœ ์‹œ๊ฐ„์— ์˜ค์ฐจ๊ฐ€ ์ƒ๊ฒผ๋‹ค. ์ƒˆ๋กœ์šด ์‹œ๊ฐ„์˜ ์ •์˜๊ฐ€ ํ•„์š”ํ•ด์ง„ ์ด์œ ๋‹ค."
68
+
69
+ messages = [
70
+ {
71
+ "role": "user",
72
+ "content": "๋‹ค์Œ ๊ธ€์„ ์š”์•ฝํ•ด์ฃผ์„ธ์š”:\n\n{}".format(doc)
73
+ }
74
+ ]
75
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
76
+
77
+ outputs = pipe_finetuned(
78
+ prompt,
79
+ do_sample=True,
80
+ temperature=0.2,
81
+ top_k=50,
82
+ top_p=0.95,
83
+ add_special_tokens=True
84
+ )
85
+ print(outputs[0]["generated_text"][len(prompt):])
86
+ ```