swanii's picture
Update README.md
51121c3 verified
|
raw
history blame
8.02 kB
metadata
license: gemma
datasets:
  - wisenut-nlp-team/llama_ko_smr
base_model:
  - google/gemma-2-2b-it
tags:
  - summary
  - finetuned

Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot

The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.

Table of Contents

  1. Dataset
  2. Model

Dataset

The dataset used for this project is sourced from the Hugging Face repository, specifically from the wisenut-nlp-team/llama_ko_smr collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.

Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.

Here is an example of the dataset:

{
  "instruction": "์ด ๊ธ€์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ์งง๊ฒŒ ์„ค๋ช…ํ•ด ์ฃผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?",
  "input": "๋ถํ•œ ์—ฐ๊ทน์— ๋Œ€ํ•œ ๋‚˜์˜ ํƒ๊ตฌ๋Š” ํ•ด๋ฐฉ๊ณต๊ฐ„์— ๋ถ์œผ๋กœ ์‚ฌ๋ผ์ ธ ๊ฐ„ ์ˆ˜๋งŽ์€ ์—ฐ๊ทน์ธ๋“ค์˜ ํ–‰์ ์„ ์ฐพ์•„๋ณด๊ณ ์ž ํ•˜๋Š” ๋‹จ์ˆœํ•œ ํ˜ธ๊ธฐ์‹ฌ์—์„œ ์‹œ์ž‘๋˜์—ˆ๋‹ค. ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•˜๋˜ ์—ฐ๊ทน์ธ์˜ ๋Œ€๋‹ค์ˆ˜๊ฐ€ ๋‚ฉโ€ค์›”๋ถ์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ๋ฅผ ์žก์•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ ์•ˆ์—๋Š” ๊ทน์ž‘๊ฐ€ ์†ก์˜, ํ•จ์„ธ๋•, ๋ฐ•์˜ํ˜ธ, ์กฐ์˜์ถœ, ์—ฐ์ถœ๊ฐ€ ์ด์„œํ–ฅ, ์•ˆ์˜์ผ, ์‹ ๊ณ ์†ก, ๋ฌด๋Œ€๋ฏธ์ˆ ๊ฐ€ ๊น€์ผ์˜, ๊ฐ•ํ˜ธ, ๋ฐฐ์šฐ ํ™ฉ์ฒ , ๊น€์„ ์˜, ๋ฌธ์˜ˆ๋ด‰, ๋งŒ๋‹ด๊ฐ€ ์‹ ๋ถˆ์ถœ ๋“ฑ ๊ธฐ๋ผ์„ฑ ๊ฐ™์€ ๋ฉค๋ฒ„๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์—ˆ๋‹ค. ๊ทธ ์ˆซ์ž๋กœ๋งŒ ๋ณธ๋‹ค๋ฉด ์ผ์ œ๊ฐ•์ ๊ธฐ ์„œ์šธ์˜ ์—ฐ๊ทน๊ณ„๊ฐ€ ํ†ต์œผ๋กœ ํ‰์–‘์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ„ ์…ˆ์ด์—ˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๋” ์ด์ƒ ๊ทธ๋“ค์˜ ์กด์žฌ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์ด๋‹ค. ๊ทธ๋“ค์€ ๋‚จ์—์„œ๋„ ๋ถ์—์„œ๋„ ์‹œ๊ณ„์—์„œ ์˜์›ํžˆ ์‚ฌ๋ผ์ ธ๋ฒ„๋ฆฐ โ€˜์žƒ์–ด๋ฒ„๋ฆฐ ์„ธ๋Œ€โ€™ ๊ทธ ์ž์ฒด์ด๋‹ค. ๊ทธ๋“ค์˜ ํ”์ ์„ ์ฐพ๋Š” ๊ฒƒ์€ ์ฐจ๋ผ๋ฆฌ ๊ณ ๊ณ ํ•™์˜ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ทธ๋“ค์ด ์—ญ์‚ฌ์˜ ์ €ํŽธ์œผ๋กœ ์‚ฌ๋ผ์ง„ ๊ทธ ์ž๋ฆฌ์— ์˜ค๋Š˜์˜ ๋ถํ•œ ์—ฐ๊ทน์ด ์„ฑ์ฑ„์ฒ˜๋Ÿผ ์œ„์šฉ์„ ์ž๋ž‘ํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ ๋ถํ•œ ์—ฐ๊ทน์€ ๋ชจ๋‘๊ฐ€ ์ฃผ์ฒด์‚ฌ์‹ค์ฃผ์˜์— ์ž…๊ฐํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง€๋Š” ์ด๋ฅธ๋ฐ” โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ์˜ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์˜ ์„ฑ๊ณผ๋ฅผ ๋ณธ๋ณด๊ธฐ๋กœ ์‚ผ์•„ ๋ชจ๋“  ์—ฐ๊ทน์ด โ€˜๋”ฐ๋ผ ๋ฐฐ์šฐ๊ธฐโ€™๋ฅผ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ถํ•œ์˜ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ์ •์ ์—์„œ ๋‚ด๋ ค ์Ÿ๋Š” ๋‹จ์„ฑ์ (ๅ–ฎ่ฒ็š„) ๋ฌธํ™”ํšŒ๋กœ ์•ˆ์— ๊ฐ‡ํ˜€ ์žˆ๋‹ค. ํ˜๋ช…์—ฐ๊ทน <์„ฑํ™ฉ๋‹น>(1978)์˜ ๋ณธ๋ณด๊ธฐ๋Š” ํ˜๋ช…๊ฐ€๊ทน <ํ”ผ๋ฐ”๋‹ค>(1971)์ด๋ฉฐ, ๊ทธ ๊ทผ์ €์—๋Š” 1960๋…„๋Œ€๋ถ€ํ„ฐ ์‹œ์ž‘๋œ ๊น€์ •์ผ ์ฃผ๋„์˜ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ๊ฐ€๋กœ๋†“์—ฌ ์žˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ , ๊ทธ ๋ชจ๋“  ๊ณผ์ •์—์„œ ๊น€์ •์ผ์˜ ๊ทธ๋ฆผ์ž์— ๋งž๋‹ฅ๋œจ๋ฆฌ์ง€ ์•Š์„ ์ˆ˜ ์—†๋‹ค. ์ตœ๊ทผ์— ๋ฐฉ๋ฌธํ•œ ์กฐ์„ ์˜ˆ์ˆ ์˜ํ™”์ดฌ์˜์†Œ ์— ์žˆ๋Š” โ€˜๋ฌธํ™”์„ฑํ˜๋ช…์‚ฌ์ ๊ด€โ€™(๊น€์ •์ผ๊ด€)์—๋Š” 1960๋…„๋Œ€ ์ค‘๋ฐ˜๋ถ€ํ„ฐ 2000๋…„๋Œ€๊นŒ์ง€ 40๋…„ ๋™์•ˆ ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„๊ฐ€ 11,890๊ฑด์ด๋ฉฐ, ๊ทธ ์ค‘ ๋ฌธํ™”์˜ˆ์ˆ ๊ธฐ๊ด€์„ ์ง์ ‘ ๋ฐฉ๋ฌธํ•˜์—ฌ ์ง€๋„ํ•œ ์ด๋ฅธ๋ฐ” โ€˜ํ˜„์ง€์ง€๋„โ€™๊ฐ€ 1,770๊ฑด์ด๋ผ๋Š” ์•ˆ๋‚ดํŒ์ด ์žˆ์—ˆ๋‹ค. ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์ด๋ผ๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ๋– ๋‚˜ ์กด์žฌํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋‹จ์ ์œผ๋กœ ๋งํ•ด ์ค€๋‹ค.",
  "output": "ํ•ด๋ฐฉ๊ณต๊ฐ„์—์„œ ํ™œ๋™ํ•œ ๋Œ€๋‹ค์ˆ˜์˜ ์—ฐ๊ทน์ธ์€ ๋‚ฉยท์›”๋ถ์„ ํ†ตํ•ด ๋ถํ•œ ์—ฐ๊ทน๊ณ„์— ์ž๋ฆฌ ์žก์•˜์ง€๋งŒ ์ด์ œ ๋ถํ•œ ์—ฐ๊ทน์—์„œ ๊ทธ๋“ค์„ ๋ณด๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ ์ž๋ฆฌ์— ์ง€๊ธˆ ๋ถํ•œ ์—ฐ๊ทน์ด ์ž๋ฆฌ ์žก์•˜๋‹ค. 1978๋…„ ๊ตญ๋ฆฝ์—ฐ๊ทน๋‹จ <์„ฑํ™ฉ๋‹น> ๊ณต์—ฐ์„ ํ† ๋Œ€๋กœ ๋ถํ•œ์˜ ๋ชจ๋“  ์—ฐ๊ทน์€ โ€˜<์„ฑํ™ฉ๋‹น>์‹ ํ˜๋ช…์—ฐ๊ทนโ€™ ์ผ์ƒ‰์ด๋‹ค. ๋ถํ•œ ์—ฐ๊ทน๊ณผ ํฌ๊ณก์€ ๋‹จ์„ฑ์  ๋ฌธํ™”ํšŒ๋กœ์— ๋ฌถ์—ฌ์žˆ๊ณ , ๊ทธ ์‹œ์ž‘์€ ๊น€์ •์ผ ์ฃผ๋„ ๋ฌธํ™”์˜ˆ์ˆ ํ˜๋ช…์ด ์žˆ๊ณ , ๋ถํ•œ ์—ฐ๊ทน์˜ ์ฐฝ์ž‘๊ณผ ํ–ฅ์œ  ๋“ฑ ๊น€์ •์ผ ํ”์ ์ด ์žˆ๋‹ค. ๊น€์ •์ผ์˜ ๋ฌธํ™”์˜ˆ์ˆ  ๋ถ€๋ฌธ ์ง€๋„ ๊ธฐ๋ก์€ ๋ถํ•œ ์—ฐ๊ทน์ด ๊น€์ •์ผ๊ณผ ์ฃผ์ฒด์‚ฌ์ƒ์„ ๋– ๋‚  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค."
}

Model

This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.

Highlight

  1. LoRA Configuration for Model Efficiency: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.

  2. Quantization for Memory Optimization: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.

  3. Fine-Tuning Parameters: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.

Inference Example Code

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments

BASE_MODEL = "google/gemma-2b-it"
FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"

finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)

pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)

doc=None
doc = r"๊ทธ๋ ‡๊ฒŒ ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ์›์ž์‹œ๊ณ„๋‹ค. ์›์ž๊ฐ€ 1์ดˆ ๋™์•ˆ ์›€์ง์ด๋Š” ํšŸ์ˆ˜์ธ โ€˜๊ณ ์œ ์ง„๋™์ˆ˜โ€™๋ฅผ ์ด์šฉํ•ด ์ •ํ™•ํ•œ 1์ดˆ๋ฅผ ์ธก์ •ํ•œ๋‹ค. ์›์ž ์†์— ์žˆ๋Š” ์ „์ž๋“ค์€ ํŠน์ • ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ์žˆ๋‹ค. ์ด ์ƒํƒœ์—์„œ ๋‹ค๋ฅธ ์ƒํƒœ๋กœ ๋ณ€ํ™”ํ•˜๋ ค๋ฉด ์—๋„ˆ์ง€๋ฅผ ๋‘ ์ƒํƒœ์˜ ์ฐจ์ด๋งŒํผ ํก์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋ฐฉ์ถœํ•ด์•ผ ํ•œ๋‹ค. ์ „์ž๊ฐ€ ์—๋„ˆ์ง€๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด(๋‹ค๋ฅธ ์—๋„ˆ์ง€ ์ƒํƒœ๋กœ ๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด) ์ „์ž๊ธฐํŒŒ๋ฅผ ํก์ˆ˜ํ•  ๋•Œ ์ง„๋™์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์ด ๋ฐ”๋กœ ๊ณ ์œ ์ง„๋™์ˆ˜๋‹ค."
#doc = r"์ฒœ๋…„๋งŒ๋…„ ์ง€๋‚˜๋„ ๋ณ€ํ•˜์ง€ ์•Š๋Š” ๊ณณ์ด ์žˆ์„๊นŒ. ๊ณผํ•™์ž๋“ค์€ ์ฒœ๋…„๋งŒ๋…„์„ ๋„˜์–ด ์ˆ˜์–ต ๋…„์ด ์ง€๋‚˜๋„ 1์ดˆ์˜ ์˜ค์ฐจ๋„ ์—†์ด ์ผ์ •ํ•˜๊ฒŒ ํ๋ฅด๋Š” ์‹œ๊ณ„๋ฅผ ๊ฐœ๋ฐœํ•˜๊ณ  ์žˆ๋‹ค. ์ง€๊ตฌ๊ฐ€ ํ•œ ๋ฐ”ํ€ด ์ž์ „ํ•˜๋Š” ์‹œ๊ฐ„์„ 1์ผ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๊ฒƒ์„ ์ชผ๊ฐœ ์‹œ๊ฐ„๊ณผ ๋ถ„, ์ดˆ๋ฅผ ์ •ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ตฌ ์ž์ „ ์†๋„๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ•˜๋ฏ€๋กœ ์‹œ๊ฐ„์— ์˜ค์ฐจ๊ฐ€ ์ƒ๊ฒผ๋‹ค. ์ƒˆ๋กœ์šด ์‹œ๊ฐ„์˜ ์ •์˜๊ฐ€ ํ•„์š”ํ•ด์ง„ ์ด์œ ๋‹ค."

messages = [
    {
        "role": "user",
        "content": "๋‹ค์Œ ๊ธ€์„ ์š”์•ฝํ•ด์ฃผ์„ธ์š”:\n\n{}".format(doc)
    }
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])