swanii's picture
Update README.md
51121c3 verified
|
raw
history blame
8.02 kB
---
license: gemma
datasets:
- wisenut-nlp-team/llama_ko_smr
base_model:
- google/gemma-2-2b-it
tags:
- summary
- finetuned
---
# Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot
The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.
## Table of Contents
1. [ Dataset ](#dataset)
2. [ Model ](#model)
## Dataset
The dataset used for this project is sourced from the Hugging Face repository, specifically from the [wisenut-nlp-team/llama_ko_smr](https://huggingface.co/datasets/wisenut-nlp-team/llama_ko_smr) collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.
Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.
Here is an example of the dataset:
```json
{
"instruction": "์ด ๊ธ€์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ์งง๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?",
"input": "๋ถํ•œ ์—ฐ๊ทน์— ๋Œ€ํ•œ ๋‚˜์˜ ํƒ๊ตฌ๋Š” ํ•ด๋ฐฉ๊ณต๊ฐ„์— ๋ถ์œผ๋กœ ์‚ฌ๋ผ์ ธ ๊ฐ„ ์ˆ˜๋งŽ์€ ์—ฐ๊ทน์ธ๋“ค์˜ ํ–‰์ ์„ ์ฐพ์•„๋ณด๊ณ ์ž ํ•˜๋Š” ๋‹จ์ˆœํ•œ ํ˜ธ๊ธฐ์‹ฌ์—์„œ ์‹œ์ž‘๋˜์—ˆ๋‹ค. ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•˜๋˜ ์—ฐ๊ทน์ธ์˜ ๋Œ€๋‹ค์ˆ˜๊ฐ€ ๋‚ฉโ€ค์›”๋ถ์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ๋ฅผ ์žก์•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ ์•ˆ์—๋Š” ๊ทน์ž‘๊ฐ€ ์†ก์˜, ํ•จ์„ธ๋•, ๋ฐ•์˜ํ˜ธ, ์กฐ์˜์ถœ, ์—ฐ์ถœ๊ฐ€ ์ด์„œํ–ฅ, ์•ˆ์˜์ผ, ์‹ ๊ณ ์†ก, ๋ฌด๋Œ€๋ฏธ์ˆ ๊ฐ€ ๊น€์ผ์˜, ๊ฐ•ํ˜ธ, ๋ฐฐ์šฐ ํ™ฉ์ฒ , ๊น€์„ ์˜, ๋ฌธ์˜ˆ๋ด‰, ๋งŒ๋‹ด๊ฐ€ ์‹ ๋ถˆ์ถœ ๋“ฑ ๊ธฐ๋ผ์„ฑ ๊ฐ™์€ ๋ฉค๋ฒ„๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์—ˆ๋‹ค. ๊ทธ ์ˆซ์ž๋กœ๋งŒ ๋ณธ๋‹ค๋ฉด ์ผ์ œ๊ฐ•์ ๊ธฐ ์„œ์šธ์˜ ์—ฐ๊ทน๊ณ„๊ฐ€ ํ†ต์œผ๋กœ ํ‰์–‘์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ„ ์…ˆ์ด์—ˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๋” ์ด์ƒ ๊ทธ๋“ค์˜ ์กด์žฌ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋‹ค. ๊ทธ๋“ค์€ ๋‚จ์—์„œ๋„ ๋ถ์—์„œ๋„ ์‹œ๊ณ„์—์„œ ์˜์›ํžˆ ์‚ฌ๋ผ์ ธ๋ฒ„๋ฆฐ โ€˜์žƒ์–ด๋ฒ„๋ฆฐ ์„ธ๋Œ€โ€™ ๊ทธ ์ž์ฒด์ด๋‹ค. ๊ทธ๋“ค์˜ ํ”์ ์„ ์ฐพ๋Š” ๊ฒƒ์€ ์ฐจ๋ผ๋ฆฌ ๊ณ ๊ณ ํ•™์˜ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ทธ๋“ค์ด ์—ญ์‚ฌ์˜ ์ €ํŽธ์œผ๋กœ ์‚ฌ๋ผ์ง„ ๊ทธ ์ž๋ฆฌ์— ์˜ค๋Š˜์˜ ๋ถํ•œ ์—ฐ๊ทน์ด ์„ฑ์ฑ„์ฒ˜๋Ÿผ ์œ„์šฉ์„ ์ž๋ž‘ํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ ๋ถํ•œ ์—ฐ๊ทน์€ ๋ชจ๋‘๊ฐ€ ์ฃผ์ฒด์‚ฌ์‹ค์ฃผ์˜์— ์ž…๊ฐํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง€๋Š” ์ด๋ฅธ๋ฐ” โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ์˜ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์˜ ์„ฑ๊ณผ๋ฅผ ๋ณธ๋ณด๊ธฐ๋กœ ์‚ผ์•„ ๋ชจ๋“  ์—ฐ๊ทน์ด โ€˜๋”ฐ๋ผ ๋ฐฐ์šฐ๊ธฐโ€™๋ฅผ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ถํ•œ์˜ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ์ •์ ์—์„œ ๋‚ด๋ ค ์Ÿ๋Š” ๋‹จ์„ฑ์ (ๅ–ฎ่ฒ็š„) ๋ฌธํ™”ํšŒ๋กœ ์•ˆ์— ๊ฐ‡ํ˜€ ์žˆ๋‹ค. ํ˜๋ช…์—ฐ๊ทน <์„ฑํ™ฉ๋‹น>(1978)์˜ ๋ณธ๋ณด๊ธฐ๋Š” ํ˜๋ช…๊ฐ€๊ทน <ํ”ผ๋ฐ”๋‹ค>(1971)์ด๋ฉฐ, ๊ทธ ๊ทผ์ €์—๋Š” 1960๋…„๋Œ€๋ถ€ํ„ฐ ์‹œ์ž‘๋œ ๊น€์ •์ผ ์ฃผ๋„์˜ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ๊ฐ€๋กœ๋†“์—ฌ ์žˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ , ๊ทธ ๋ชจ๋“  ๊ณผ์ •์—์„œ ๊น€์ •์ผ์˜ ๊ทธ๋ฆผ์ž์— ๋งž๋‹ฅ๋œจ๋ฆฌ์ง€ ์•Š์„ ์ˆ˜ ์—†๋‹ค. ์ตœ๊ทผ์— ๋ฐฉ๋ฌธํ•œ ์กฐ์„ ์˜ˆ์ˆ ์˜ํ™”์ดฌ์˜์†Œ ์— ์žˆ๋Š” โ€˜๋ฌธํ™”์„ฑํ˜๋ช…์‚ฌ์ ๊ด€โ€™(๊น€์ •์ผ๊ด€)์—๋Š” 1960๋…„๋Œ€ ์ค‘๋ฐ˜๋ถ€ํ„ฐ 2000๋…„๋Œ€๊นŒ์ง€ 40๋…„ ๋™์•ˆ ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„๊ฐ€ 11,890๊ฑด์ด๋ฉฐ, ๊ทธ ์ค‘ ๋ฌธํ™”์˜ˆ์ˆ ๊ธฐ๊ด€์„ ์ง์ ‘ ๋ฐฉ๋ฌธํ•˜์—ฌ ์ง€๋„ํ•œ ์ด๋ฅธ๋ฐ” โ€˜ํ˜„์ง€์ง€๋„โ€™๊ฐ€ 1,770๊ฑด์ด๋ผ๋Š” ์•ˆ๋‚ดํŒ์ด ์žˆ์—ˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์ด๋ผ๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ๋– ๋‚˜ ์กด์žฌํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋‹จ์ ์œผ๋กœ ๋งํ•ด ์ค€๋‹ค.",
"output": "ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•œ ๋Œ€๋‹ค์ˆ˜์˜ ์—ฐ๊ทน์ธ์€ ๋‚ฉยท์›”๋ถ์„ ํ†ตํ•ด ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ ์žก์•˜์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๊ทธ๋“ค์„ ๋ณด๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ ์ž๋ฆฌ์— ์ง€๊ธˆ ๋ถํ•œ ์—ฐ๊ทน์ด ์ž๋ฆฌ ์žก์•˜๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์„ ํ† ๋Œ€๋กœ ๋ถํ•œ์˜ ๋ชจ๋“  ์—ฐ๊ทน์€ โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. ๋ถํ•œ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ๋‹จ์„ฑ์  ๋ฌธํ™”ํšŒ๋กœ์— ๋ฌถ์—ฌ์žˆ๊ณ , ๊ทธ ์‹œ์ž‘์€ ๊น€์ •์ผ ์ฃผ๋„ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ์žˆ๊ณ , ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ  ๋“ฑ ๊น€์ •์ผ ํ”์ ์ด ์žˆ๋‹ค. ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„ ๊ธฐ๋ก์€ ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์„ ๋– ๋‚  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค."
}
```
## Model
This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.
### Highlight
1. **LoRA Configuration for Model Efficiency**: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.
2. **Quantization for Memory Optimization**: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.
3. **Fine-Tuning Parameters**: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.
## Inference Example Code
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments
BASE_MODEL = "google/gemma-2b-it"
FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"
finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)
pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
doc=None
doc = r"๊ทธ๋ ‡๊ฒŒ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ์›์ž์‹œ๊ณ„๋‹ค. ์›์ž๊ฐ€ 1์ดˆ ๋™์•ˆ ์›€์ง์ด๋Š” ํšŸ์ˆ˜์ธ โ€˜๊ณ ์œ ์ง„๋™์ˆ˜โ€™๋ฅผ ์ด์šฉํ•ด ์ •ํ™•ํ•œ 1์ดˆ๋ฅผ ์ธก์ •ํ•œ๋‹ค. ์›์ž ์†์— ์žˆ๋Š” ์ „์ž๋“ค์€ ํŠน์ • ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ์žˆ๋‹ค. ์ด ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ๋ณ€ํ™”ํ•˜๋ ค๋ฉด ์—๋„ˆ์ง€๋ฅผ ๋‘ ์ƒํƒœ์˜ ์ฐจ์ด๋งŒํผ ํก์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋ฐฉ์ถœํ•ด์•ผ ํ•œ๋‹ค. ์ „์ž๊ฐ€ ์—๋„ˆ์ง€๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด(๋‹ค๋ฅธ ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด) ์ „์ž๊ธฐํŒŒ๋ฅผ ํก์ˆ˜ํ•  ๋•Œ ์ง„๋™์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์ด ๋ฐ”๋กœ ๊ณ ์œ ์ง„๋™์ˆ˜๋‹ค."
#doc = r"์ฒœ๋…„๋งŒ๋…„ ์ง€๋‚˜๋„ ๋ณ€ํ•˜์ง€ ์•Š๋Š” ๊ณณ์ด ์žˆ์„๊นŒ. ๊ณผํ•™์ž๋“ค์€ ์ฒœ๋…„๋งŒ๋…„์„ ๋„˜์–ด ์ˆ˜์–ต ๋…„์ด ์ง€๋‚˜๋„ 1์ดˆ์˜ ์˜ค์ฐจ๋„ ์—†์ด ์ผ์ •ํ•˜๊ฒŒ ํ๋ฅด๋Š” ์‹œ๊ณ„๋ฅผ ๊ฐœ๋ฐœํ•˜๊ณ  ์žˆ๋‹ค. ์ง€๊ตฌ๊ฐ€ ํ•œ ๋ฐ”ํ€ด ์ž์ „ํ•˜๋Š” ์‹œ๊ฐ„์„ 1์ผ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๊ฒƒ์„ ์ชผ๊ฐœ ์‹œ๊ฐ„๊ณผ ๋ถ„, ์ดˆ๋ฅผ ์ •ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ตฌ ์ž์ „ ์†๋„๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ•˜๋ฏ€๋กœ ์‹œ๊ฐ„์— ์˜ค์ฐจ๊ฐ€ ์ƒ๊ฒผ๋‹ค. ์ƒˆ๋กœ์šด ์‹œ๊ฐ„์˜ ์ •์˜๊ฐ€ ํ•„์š”ํ•ด์ง„ ์ด์œ ๋‹ค."
messages = [
{
"role": "user",
"content": "๋‹ค์Œ ๊ธ€์„ ์š”์•ฝํ•ด์ฃผ์„ธ์š”:\n\n{}".format(doc)
}
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe_finetuned(
prompt,
do_sample=True,
temperature=0.2,
top_k=50,
top_p=0.95,
add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])
```