Update README.md

87447ed verified 12 days ago

6.78 kB

	---
	library_name: transformers
	license: cc-by-nc-sa-4.0
	language:
	- ja
	- en
	base_model:
	- llm-jp/llm-jp-3-13b
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	llm-jp-3-13bモデルをichikaraデータセットでSFTしたモデルです。<br>
	アップロードされているファイルはLoraアダプタのみです。<br>
	HF_TOKEN, WB_TOKENはご自身のものに書き換えてください。<br>

	## How to Get Started with the Model

	- Jupyter Notebook
	[Training-Inference-code.ipynb](https://huggingface.co/chocopan/llm-jp-3-13b-finetune-4bit/blob/main/Training-Inference-code.ipynb)
	- Training Dataset
	ichikara-instruction-003-merge.json
	- Test Dataset
	ELYZA-tasks-100-TV (not included)

	### File Tree
	```
	/workspace
	\|--Training-Inference-code.ipynb
	\|--models/models--llm-jp--llm-jp-3-13b/snapshots/cd3823f4c1fcbb0ad2e2af46036ab1b0ca13192a
	\|--ichikara-instruction-003-merge.json
	`--elyza-tasks-100-TV_0.jsonl
	```

	### Usage
	Execute following code in Google Colab

	```python
	!pip install -U pip
	!pip install -U transformers
	!pip install -U bitsandbytes
	!pip install -U accelerate
	!pip install -U datasets
	!pip install -U peft

	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch
	import bitsandbytes as bnb # bitsandbytesをインポート
	import json
	import re
	from tqdm import tqdm
	from sklearn.metrics import f1_score

	# ベースモデルとLoRAアダプタのID
	model_id = "llm-jp/llm-jp-3-13b"
	adapter_id = "chocopan/llm-jp-3-13b-finetune-4bit"

	# トークナイザーのロード
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# モデルのロード (4bit量子化)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	load_in_4bit=True, # 4bit量子化を有効にする
	bnb_4bit_use_double_quant=True, # double quantizationを使用 (さらにメモリ効率を高める)
	bnb_4bit_quant_type="nf4", # 量子化のタイプ (NF4が推奨)
	torch_dtype=torch.bfloat16, # bfloat16を使用
	device_map="auto" # デバイスマップ
	)

	# LoRAアダプタのロード
	model = PeftModel.from_pretrained(model, adapter_id)
	model.eval()

	# タスクとなるデータの読み込み。
	# 事前にデータをアップロードしてください。
	datasets = []
	with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
	item = ""
	for line in f:
	line = line.strip()
	item += line
	if item.endswith("}"):
	datasets.append(json.loads(item))
	item = ""


	# 推論の実行
	results = []
	for dt in tqdm(datasets):
	input_text = dt["input"]
	prompt = f"""### 指示
	{input}
	### 回答
	"""
	inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
	# Remove token_type_ids from inputs if present
	if "token_type_ids" in inputs:
	del inputs["token_type_ids"]
	outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True, do_sample=False, repetition_penalty=1.2)
	prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### 回答')[-1]
	prediction = re.sub(r"[*#]", "", prediction).strip() # 前後の空白を削除
	results.append({
	"task_id": dt.get("task_id", None), # task_idがない場合への対応
	"input": input_text,
	"prediction": prediction,
	"expected": dt.get("output", None) # 期待データ
	})

	# 評価
	exact_match_count = 0
	total_count = 0
	f1_scores = []

	for result in results:
	if result["expected"] is None: # 期待データがない場合はスキップ
	continue

	total_count += 1
	expected = result["expected"].strip() # 前後の空白を削除
	prediction = result["prediction"].strip() # 前後の空白を削除

	if prediction == expected:
	exact_match_count += 1

	# F1スコアの計算 (単語単位)
	expected_words = expected.split()
	prediction_words = prediction.split()

	if len(expected_words) == 0 and len(prediction_words) == 0:
	f1 = 1.0 # 両方空の場合は1.0
	elif len(expected_words) == 0 or len(prediction_words) == 0:
	f1 = 0.0 # 片方が空の場合は0.0
	else:
	f1 = f1_score(expected_words, prediction_words, average='micro') # 単語単位のF1スコア
	f1_scores.append(f1)

	# 評価結果の出力
	exact_match_rate = exact_match_count / total_count if total_count > 0 else 0
	average_f1 = sum(f1_scores) / len(f1_scores) if len(f1_scores) > 0 else 0

	print(f"Exact Match Rate: {exact_match_rate:.4f}")
	print(f"Average F1 Score: {average_f1:.4f}")

	# 結果をjsonlで保存 (評価結果も追加)
	json_file_id = re.sub(".*/", "", adapter_id)
	with open(f"/content/{json_file_id}_output.jsonl", 'w', encoding='utf-8') as f:
	for result in results:
	result["exact_match"] = 1 if result["prediction"].strip() == result["expected"].strip() else 0 if result["expected"] is not None else None
	f.write(json.dumps(result, ensure_ascii=False) + '\n')
	```

	## Training Details
	```
	training_arguments = TrainingArguments(
	output_dir=new_model_id,
	per_device_train_batch_size=1,
	gradient_accumulation_steps=2,
	optim="paged_adamw_32bit",
	num_train_epochs=2,
	logging_strategy="steps",
	logging_steps=10,
	warmup_steps=10,
	save_steps=100,
	save_total_limit = 2,
	max_steps = -1,
	learning_rate=5e-5,
	fp16=False,
	bf16=True,
	seed = 1001,
	group_by_length=True,
	report_to="wandb"
	)
	```
	### Training Results
	Training Time: 5:52:48<br>
	Total steps: 6030 steps<br>
	Epoch: 2<br>

	<div style="width: auto; margin-left: auto; margin-right: auto">
	<img src="train.jpg" alt="Train" style="width: 100%; min-width: 400px; display: block; margin: auto;">
	</div>

	### Training Dataset

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
	- LMのための日本語インストラクションデータ(ichikara-instruction)
	[https://liat-aip.sakura.ne.jp/wp/llm](https://liat-aip.sakura.ne.jp/wp/llm)
	のための日本語インストラクションデータ作成/llmのための日本語インストラクションデータ-公開/
	関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎. ichikara-instruction: LLMのための日本語インストラクションデータの構築. 言語処理学会第30回年次大会(2024)<br>

	上記datasetをすべてマージし、IDを連番になるよう振りなおしています。<br>
	LICENCE: CC-BY-NC-SA<br>

	#### Hardware

	Google Cloud Platform<br>
	L4 GPU 24GB<br>
	RAM 48GB<br>

	#### Software

	transformers==4.46.3<br>
	trl==0.12.2<br>
	Others: latast<br>