Aratako
/

Llama-Gemma-2-27b-ORPO-iter3-5.8bpw

@@ -1,67 +1,141 @@
 ---
-base_model: Aratako/Llama-Gemma-2-27b-Simpo-trial3-iter1
 library_name: transformers
-model_name: fft-orpo-iterative-iter3
 tags:
-- generated_from_trainer
 - axolotl
 - trl
 - orpo
-licence: license
 ---
-# Model Card for fft-orpo-iterative-iter3
-This model is a fine-tuned version of [Aratako/Llama-Gemma-2-27b-Simpo-trial3-iter1](https://huggingface.co/Aratako/Llama-Gemma-2-27b-Simpo-trial3-iter1).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
-```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="Aratako/fft-orpo-iterative-iter3", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/aratako-lm/27b-fft/runs/5o3squgn)
-This model was trained with ORPO, a method introduced in [ORPO: Monolithic Preference Optimization without Reference Model](https://huggingface.co/papers/2403.07691).
-### Framework versions
-- TRL: 0.12.0
-- Transformers: 4.46.3
-- Pytorch: 2.3.1+cu121
-- Datasets: 3.1.0
-- Tokenizers: 0.20.3
-## Citations
-Cite ORPO as:
-```bibtex
-@article{hong2024orpo,
-    title        = {{ORPO: Monolithic Preference Optimization without Reference Model}},
-    author       = {Jiwoo Hong and Noah Lee and James Thorne},
-    year         = 2024,
-    eprint       = {arXiv:2403.07691}
-}
-```
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 ---
+base_model: Aratako/Llama-Gemma-2-27b-ORPO-iter3
 library_name: transformers
 tags:
 - axolotl
 - trl
 - orpo
+- exl2
+license:
+- llama3.1
+- gemma
 ---
+# Llama-Gemma-2-27b-ORPO-iter3-5.8bpw
+## 概要
+[Aratako/Llama-Gemma-2-27b-ORPO-iter3](https://huggingface.co/Aratako/Llama-Gemma-2-27b-ORPO-iter3)を[ExLlamaV2](https://github.com/turboderp/exllamav2)を使って5.8 bpwに量子化したモデルです。
+[松尾研大規模言語モデル講座2024](https://weblab.t.u-tokyo.ac.jp/lecture/course-list/large-language-model/)のコンペ用の提出モデル作成の一環として作成・公開しています。
+This model is built with Llama and Qwen.
+学習データ等の詳細については元モデルの概要をご確認ください。
+## 推論方法
+[松尾研大規模言語モデル講座2024](https://weblab.t.u-tokyo.ac.jp/lecture/course-list/large-language-model/)のコンペのタスクの推論方法を以下に記載します。
+1. 以下のようにして推論の環境を準備します。
+  ```bash
+  git clone https://github.com/turboderp/exllamav2
+  cd exllamav2
+  pip install -r requirements.txt
+  # PyTorchやCUDA、Pythonのバージョンを合わせてExLlamaV2をインストール
+  # ここではCUDA 12.2、PyTorch 2.5.1、Python 3.10の環境が初期状態だと仮定する
+  pip install https://github.com/turboderp/exllamav2/releases/download/v0.2.5/exllamav2-0.2.5+cu121.torch2.5.0-cp310-cp310-linux_x86_64.whl
+  pip install torch==2.5.0
+  pip install -U --no-build-isolation flash-attn
+  # モデルの用意
+  huggingface-cli download Aratako/Llama-Gemma-2-27b-ORPO-iter3-5.8bpw --local-dir ./Llama-Gemma-2-27b-ORPO-iter3-5.8bpw
+  ```
+2. 以下のようなPythonファイルをelyza_tasks_100_tv_exllamav2.pyとして作成します。また、elyza-tasks-100-TV_0.jsonlを同じディレクトリに配置します。
+  <details><summary>elyza_tasks_100_tv_exllamav2.py</summary>
+  ```python
+  import argparse
+  import json
+  from transformers import AutoTokenizer
+  from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache_Q8, ExLlamaV2Tokenizer
+  from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler
+  from datasets import load_dataset
+  parser = argparse.ArgumentParser()
+  parser.add_argument("-m", "--model", help="評価するモデル", required=True)
+  parser.add_argument("-t", "--tokenizer", help="使用するトークナイザ")
+  parser.add_argument("-o", "--output", help="出力jsonlファイルの名前")
+  args = parser.parse_args()
+  if args.tokenizer is None:
+      args.tokenizer = args.model
+  if args.output is None:
+      args.output = f"answers-{args.model.split('/')[-1]}.jsonl"
+  ds = load_dataset("json", data_files="./elyza-tasks-100-TV_0.jsonl", split="train")
+  hf_tokenizer = AutoTokenizer.from_pretrained(args.model)
+  # ExLlamaV2の設定
+  config = ExLlamaV2Config(args.model)
+  config.arch_compat_overrides()
+  model = ExLlamaV2(config)
+  cache = ExLlamaV2Cache_Q8(model, max_seq_len=2304, lazy=True)
+  model.load_autosplit(cache, progress=True)
+  tokenizer = ExLlamaV2Tokenizer(config)
+  generator = ExLlamaV2DynamicGenerator(
+      model=model,
+      cache=cache,
+      tokenizer=tokenizer,
+  )
+  # 推論パラメータ
+  gen_settings = ExLlamaV2Sampler.Settings.greedy()
+  # 入力が768tokenを超える場合エラーになるはずなのでその場合はここを256ずつ減らしてください。
+  max_tokens = 1536
+  def apply_chat_template(item):
+      messages = [
+          {"role": "user", "content": item["input"]}
+      ]
+      item["prompt"] = hf_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+      return item
+  ds = ds.map(apply_chat_template, batched=False)
+  def generate_answer(batch):
+      outputs = generator.generate(
+          prompt=batch["prompt"],
+          max_new_tokens=max_tokens,
+          stop_conditions=[tokenizer.eos_token_id],
+          gen_settings=gen_settings,
+          encode_special_tokens=True,
+      )
+      # 入力部分も含めて返ってくるので出力部分だけを取り出す
+      outputs = [text.split("<start_of_turn>model\n", 1)[-1] for text in outputs]
+      print(outputs)
+      batch["output"] = outputs
+      return batch
+  ds = ds.map(generate_answer, batched=True, batch_size=10)
+  ds = ds.remove_columns("prompt")
+  with open(args.output, "w", encoding="utf-8") as f:
+      for row in ds:
+          json.dump(row, f, ensure_ascii=False)
+          f.write("\n")
+  ```
+  </details><br>
+3. 以下のように推論を実行します。推論が完了するとデフォルトではanswers-Llama-Gemma-2-27b-ORPO-iter3-5.8bpw.jsonlに回答が保存されます。
+  ```bash
+  python elyza_tasks_100_tv_exllamav2.py -m Llama-Gemma-2-27b-ORPO-iter3-5.8bpw
+  ```
+## ライセンス
+本モデルは学習に利用したデータの関係で以下のライセンスの影響を受けます。
+- [META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/)を継承します。
+- [Gemma Terms of Use](https://ai.google.dev/gemma/terms)を継承します。
+- [Qwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE)の影響を受けます。ライセンスは継承しませんが、「Built with Qwen」のような文言を記載する必要があります。