Update README.md

51121c3 verified about 2 months ago

8.02 kB


	---
	license: gemma
	datasets:
	- wisenut-nlp-team/llama_ko_smr
	base_model:
	- google/gemma-2-2b-it
	tags:
	- summary
	- finetuned
	---

	# Gemma LLM Model Fine-Tuning for Technical Summarization Chat Bot

	The Gemma LLM model is being fine-tuned specifically for use in a technical summarization chatbot. This chatbot will leverage the model's ability to understand and summarize complex technical content, making it easier for users to engage with technical materials. The fine-tuning process is aimed at improving the model's performance in accurately capturing the essential points from dense, technical information, and providing concise, user-friendly summaries. The end goal is to enhance user experience in environments where quick, reliable technical insights are required.

	## Table of Contents

	1. [ Dataset ](#dataset)
	2. [ Model ](#model)


	## Dataset

	The dataset used for this project is sourced from the Hugging Face repository, specifically from the [wisenut-nlp-team/llama_ko_smr](https://huggingface.co/datasets/wisenut-nlp-team/llama_ko_smr) collection. This dataset contains various types of summarization data, including document summaries, book summaries, research paper summaries, TV content script summaries, Korean dialogue summaries, and technical/scientific summaries. Each entry in the dataset consists of the instruction, main text, and its corresponding summary.

	Instead of limiting the training to just the technical and scientific summarization data, I opted to use the entire dataset to expose the model to a wider variety of content types. This decision was made to ensure the model is well-rounded and can handle diverse types of summarization tasks, improving its overall performance across different domains.

	Here is an example of the dataset:

	```json
	{
	"instruction": "이 글의 주요 내용을 짧게 설명해 주실 수 있습니까?",
	"input": "북한 연극에 대한 나의 탐구는 해방공간에 북으로 사라져 간 수많은 연극인들의 행적을 찾아보고자 하는 단순한 호기심에서 시작되었다. 해방공간에서 활동하던 연극인의 대다수가 납․월북의 과정을 거쳐 북한 연극계에 자리를 잡았기 때문이다. 그 안에는 극작가 송영, 함세덕, 박영호, 조영출, 연출가 이서향, 안영일, 신고송, 무대미술가 김일영, 강호, 배우 황철, 김선영, 문예봉, 만담가 신불출 등 기라성 같은 멤버들이 포함되어 있었다. 그 숫자로만 본다면 일제강점기 서울의 연극계가 통으로 평양으로 옮겨간 셈이었다. 그렇지만 이제 북한 연극에서 더 이상 그들의 존재를 확인하기 어려운 상황이다. 그들은 남에서도 북에서도 시계에서 영원히 사라져버린 ‘잃어버린 세대’ 그 자체이다. 그들의 흔적을 찾는 것은 차라리 고고학의 과제가 되었다. 그들이 역사의 저편으로 사라진 그 자리에 오늘의 북한 연극이 성채처럼 위용을 자랑하고 있다. 오늘날의 북한 연극은 모두가 주체사실주의에 입각하여 만들어지는 이른바 ‘<성황당>식 혁명연극’ 일색이다. 1978년 국립연극단의 <성황당> 공연의 성과를 본보기로 삼아 모든 연극이 ‘따라 배우기’를 하고 있기 때문이다. 북한의 연극과 희곡은 정점에서 내려 쏟는 단성적(單聲的) 문화회로 안에 갇혀 있다. 혁명연극 <성황당>(1978)의 본보기는 혁명가극 <피바다>(1971)이며, 그 근저에는 1960년대부터 시작된 김정일 주도의 문화예술혁명이 가로놓여 있다. 북한 연극의 창작과 향유, 그 모든 과정에서 김정일의 그림자에 맞닥뜨리지 않을 수 없다. 최근에 방문한 조선예술영화촬영소 에 있는 ‘문화성혁명사적관’(김정일관)에는 1960년대 중반부터 2000년대까지 40년 동안 김정일의 문화예술 부문 지도가 11,890건이며, 그 중 문화예술기관을 직접 방문하여 지도한 이른바 ‘현지지도’가 1,770건이라는 안내판이 있었다. 북한 연극이 김정일과 주체사상이라는 키워드를 떠나 존재할 수 없다는 것을 단적으로 말해 준다.",
	"output": "해방공간에서 활동한 대다수의 연극인은 납·월북을 통해 북한 연극계에 자리 잡았지만 이제 북한 연극에서 그들을 보기 어렵다. 그 자리에 지금 북한 연극이 자리 잡았다. 1978년 국립연극단 <성황당> 공연을 토대로 북한의 모든 연극은 ‘<성황당>식 혁명연극’ 일색이다. 북한 연극과 희곡은 단성적 문화회로에 묶여있고, 그 시작은 김정일 주도 문화예술혁명이 있고, 북한 연극의 창작과 향유 등 김정일 흔적이 있다. 김정일의 문화예술 부문 지도 기록은 북한 연극이 김정일과 주체사상을 떠날 수 없는 것을 보여준다."
	}
	```

	## Model

	This model is built on the gemma-2-2b-it base and fine-tuned using advanced techniques such as BitsAndBytes for memory optimization, LoRA for efficient adaptation, and the SFTTrainer framework. You can find the fine-tuned version of this model on Hugging Face at this link.

	### Highlight

	1. LoRA Configuration for Model Efficiency: The model is fine-tuned using Low-Rank Adaptation (LoRA) with specific configurations like r=6, lora_alpha=8, and a dropout of 0.05. This allows for efficient adaptation of the model without modifying all layers.

	2. Quantization for Memory Optimization: The BitsAndBytesConfig is set to load the model in 4-bit precision, using nf4 quantization. This reduces memory usage, making it possible to fine-tune the model on larger datasets.

	3. Fine-Tuning Parameters: Fine-tuning is set up using SFTTrainer, with a batch size of 1, gradient_accumulation_steps=4, and max_steps=3000. The training uses 8-bit AdamW optimizer (paged_adamw_8bit) for better performance in a memory-constrained environment.

	## Inference Example Code

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments

	BASE_MODEL = "google/gemma-2b-it"
	FINETUNE_MODEL = "./gemma-2b-it-sum-ko-science"

	finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
	tokenizer = AutoTokenizer.from_pretrained(FINETUNE_MODEL)

	pipe = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)
	pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)

	doc=None
	doc = r"그렇게 등장한 것이 원자시계다. 원자가 1초 동안 움직이는 횟수인 ‘고유진동수’를 이용해 정확한 1초를 측정한다. 원자 속에 있는 전자들은 특정 에너지 상태로 있다. 이 상태에서 다른 상태로 변화하려면 에너지를 두 상태의 차이만큼 흡수하거나 방출해야 한다. 전자가 에너지를 얻기 위해(다른 에너지 상태로 변하기 위해) 전자기파를 흡수할 때 진동이 발생하는데, 이것이 바로 고유진동수다."
	#doc = r"천년만년 지나도 변하지 않는 곳이 있을까. 과학자들은 천년만년을 넘어 수억 년이 지나도 1초의 오차도 없이 일정하게 흐르는 시계를 개발하고 있다. 지구가 한 바퀴 자전하는 시간을 1일이라고 한다. 이것을 쪼개 시간과 분, 초를 정했다. 하지만 지구 자전 속도는 시간에 따라 변하므로 시간에 오차가 생겼다. 새로운 시간의 정의가 필요해진 이유다."

	messages = [
	{
	"role": "user",
	"content": "다음 글을 요약해주세요:\n\n{}".format(doc)
	}
	]
	prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	outputs = pipe_finetuned(
	prompt,
	do_sample=True,
	temperature=0.2,
	top_k=50,
	top_p=0.95,
	add_special_tokens=True
	)
	print(outputs[0]["generated_text"][len(prompt):])
	```