metadata

license: gemma
datasets:
  - wisenut-nlp-team/llama_ko_smr
base_model:
  - google/gemma-2-2b-it
tags:
  - summary
  - finetuned

Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot

The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.

Dataset
Model

Dataset

The dataset used for this project is sourced from the Hugging Face repository, specifically from the wisenut-nlp-team/llama_ko_smr collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.

Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.

Here is an example of the dataset:

{
  "instruction": "이 글의 주요 내용을 짧게 설명해 주실 수 있습니까?",
  "input": "북한 연극에 대한 나의 탐구는 해방공간에 북으로 사라져 간 수많은 연극인들의 행적을 찾아보고자 하는 단순한 호기심에서 시작되었다. 해방공간에서 활동하던 연극인의 대다수가 납․월북의 과정을 거쳐 북한 연극계에 자리를 잡았기 때문이다. 그 안에는 극작가 송영, 함세덕, 박영호, 조영출, 연출가 이서향, 안영일, 신고송, 무대미술가 김일영, 강호, 배우 황철, 김선영, 문예봉, 만담가 신불출 등 기라성 같은 멤버들이 포함되어 있었다. 그 숫자로만 본다면 일제강점기 서울의 연극계가 통으로 평양으로 옮겨간 셈이었다. 그렇지만 이제 북한 연극에서 더 이상 그들의 존재를 확인하기 어려운 상황이다. 그들은 남에서도 북에서도 시계에서 영원히 사라져버린 ‘잃어버린 세대’ 그 자체이다. 그들의 흔적을 찾는 것은 차라리 고고학의 과제가 되었다. 그들이 역사의 저편으로 사라진 그 자리에 오늘의 북한 연극이 성채처럼 위용을 자랑하고 있다. 오늘날의 북한 연극은 모두가 주체사실주의에 입각하여 만들어지는 이른바 ‘<성황당>식 혁명연극’ 일색이다. 1978년 국립연극단의 <성황당> 공연의 성과를 본보기로 삼아 모든 연극이 ‘따라 배우기’를 하고 있기 때문이다. 북한의 연극과 희곡은 정점에서 내려 쏟는 단성적(單聲的) 문화회로 안에 갇혀 있다. 혁명연극 <성황당>(1978)의 본보기는 혁명가극 <피바다>(1971)이며, 그 근저에는 1960년대부터 시작된 김정일 주도의 문화예술혁명이 가로놓여 있다. 북한 연극의 창작과 향유, 그 모든 과정에서 김정일의 그림자에 맞닥뜨리지 않을 수 없다. 최근에 방문한 조선예술영화촬영소 에 있는 ‘문화성혁명사적관’(김정일관)에는 1960년대 중반부터 2000년대까지 40년 동안 김정일의 문화예술 부문 지도가 11,890건이며, 그 중 문화예술기관을 직접 방문하여 지도한 이른바 ‘현지지도’가 1,770건이라는 안내판이 있었다. 북한 연극이 김정일과 주체사상이라는 키워드를 떠나 존재할 수 없다는 것을 단적으로 말해 준다.",
  "output": "해방공간에서 활동한 대다수의 연극인은 납·월북을 통해 북한 연극계에 자리 잡았지만 이제 북한 연극에서 그들을 보기 어렵다. 그 자리에 지금 북한 연극이 자리 잡았다. 1978년 국립연극단 <성황당> 공연을 토대로 북한의 모든 연극은 ‘<성황당>식 혁명연극’ 일색이다. 북한 연극과 희곡은 단성적 문화회로에 묶여있고, 그 시작은 김정일 주도 문화예술혁명이 있고, 북한 연극의 창작과 향유 등 김정일 흔적이 있다. 김정일의 문화예술 부문 지도 기록은 북한 연극이 김정일과 주체사상을 떠날 수 없는 것을 보여준다."
}

Model

This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.

Highlight

LoRA Configuration for Model Efficiency: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.
Quantization for Memory Optimization: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.
Fine-Tuning Parameters: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.

Inference Example Code

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments

BASE_MODEL = "google/gemma-2b-it"
FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"

finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)

pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)

doc=None
doc = r"그렇게 등장한 것이 원자시계다. 원자가 1초 동안 움직이는 횟수인 ‘고유진동수’를 이용해 정확한 1초를 측정한다. 원자 속에 있는 전자들은 특정 에너지 상태로 있다. 이 상태에서 다른 상태로 변화하려면 에너지를 두 상태의 차이만큼 흡수하거나 방출해야 한다. 전자가 에너지를 얻기 위해(다른 에너지 상태로 변하기 위해) 전자기파를 흡수할 때 진동이 발생하는데, 이것이 바로 고유진동수다."
#doc = r"천년만년 지나도 변하지 않는 곳이 있을까. 과학자들은 천년만년을 넘어 수억 년이 지나도 1초의 오차도 없이 일정하게 흐르는 시계를 개발하고 있다. 지구가 한 바퀴 자전하는 시간을 1일이라고 한다. 이것을 쪼개 시간과 분, 초를 정했다. 하지만 지구 자전 속도는 시간에 따라 변하므로 시간에 오차가 생겼다. 새로운 시간의 정의가 필요해진 이유다."

messages = [
    {
        "role": "user",
        "content": "다음 글을 요약해주세요:\n\n{}".format(doc)
    }
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

swanii
/

gemma-2b-it-ko_science_summary

Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot

Table of Contents

Dataset

Model

Highlight

Inference Example Code