pokutuna
/

llm2024-competition

Inference Endpoints

Model card Files Files and versions Community

llm2024-competition / README.md

pokutuna's picture

Update README.md

3f492b4 verified 10 days ago

|

2.51 kB

	---
	license: gemma
	language:
	- ja
	- en
	datasets:
	- llm-jp/magpie-sft-v1.0
	- kajuma/CC-news-2024-July-October-cleaned
	- weblab-GENIAC/aya-ja-nemotron-dpo-masked
	base_model:
	- google/gemma-2-9b
	---

	## Training Dataset

	以下のデータセットをサンプリング & 前処理の上、学習に用いました。

	- [kajuma/CC-news-2024-July-October-cleaned](https://huggingface.co/datasets/kajuma/CC-news-2024-July-October-cleaned) (ODC-By)
	- 一定以上のテキスト長があり単体のニュース記事とみなせるものをフィルタし本文部分を抽出して利用
	- [llm-jp/magpie-sft-v1.0](https://huggingface.co/datasets/llm-jp/magpie-sft-v1.0) (apache-2.0)
	- サンプリングして指示チューニングに利用
	- [weblab-GENIAC/aya-ja-nemotron-dpo-masked](https://huggingface.co/datasets/weblab-GENIAC/aya-ja-nemotron-dpo-masked) (apache-2.0)
	- サンプリングして選好チューニングに利用

	## 実行方法(コンペ採点者の方向け)

	### 事前準備

	```
	# lshw のインストール (ollama インストール時に GPU を検出するのに必要)
	$ apt update && apt install -y lshw

	# ollama (https://ollama.com/) のインストール & 起動
	$ curl -fsSL https://ollama.com/install.sh \| sh
	$ ollama serve

	# -- 以降は別ターミナルプロセスから実行(ollama サーバーに対して実行) --

	# モデルのダウンロード
	$ ollama pull hf.co/pokutuna/llm2024-gemma2:gemma2-9b-v10.gguf
	#
	# Note.
	# ダウンロード後、success と出力されるのを確認して下さい。
	# 演習環境で動作を確認済みですがネットワーク状況等により、
	# timeout (context deadline exceeded) が発生することがあります。
	# 何度か実行すれば走り切ります。

	# 回答生成コードの pull
	$ git clone https://github.com/pokutuna/llm2024-competition-runner.git

	# 依存ライブラリのインストール
	$ pip install -r llm2024-competition-runner/requirements.txt
	```

	### 出力の生成

	```sh
	$ python ./llm2024-competition-runner/generate.py \
	--model="hf.co/pokutuna/llm2024-gemma2:gemma2-9b-v10.gguf" \
	--tasks=./tasks.jsonl \
	--outfile=./output.jsonl
	```

	- `--tasks=<path>`
	- タスクデータ、各行に `input` フィールドを持つ JSONL ファイルへのパス
	(`elyza-tasks-100-TV_0.jsonl` と同じ構造を想定)
	- `--outfile=<path>`
	- 結果の出力先、タスクデータの各行に対し `output` キーを出力結果として追加したもの