本ggufモデルについて about this gguf model

gemma-2-2b-itを日本語が多く含まれる重要度行列(iMatrix)を使って量子化したgguf版です。日本語対応能力が多めに保持されている事を期待しています。
This is a quantized gguf version of gemma-2-2b-it using an importance matrix (iMatrix) that contains many Japanese words. I hope it retains more Japanese support.

また、最新のllama.cppに実装された投機的デコード(Speculative decoding)という新しいテクニックを使ってより大きいモデルの実行速度を上げる事ができます。 also, using latest llama.cpp and a new technique called speculative decoding, we can speed up larger models.

windows speculative decoding command sample(ROCm compiled version)

set HSA_OVERRIDE_GFX_VERSION=gfx1103 && .\llama-server.exe ^
    -m  .\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -md .\gemma-2-2b-it-IQ3_XXS.gguf ^
    -ngl 10 -ngld 10 -e --temp 0 -c 4096 ^
    --draft-max 16 --draft-min 5

私のテストプロンプトの実行時間: 1576.67秒
My test prompt execution time: 1576.67 seconds

windows normal command sample

.\llama-server.exe ^
    -m  ..\gemma\gemma-2-27B-it-Q4_K_M-fp16.gguf ^
    -e --temp 0 -c 4096

私のテストプロンプトの実行時間: 4591.58秒
My test prompt execution time: 4591.58 seconds

CUDAのサンプルについてはdahara1/Qwen2.5-0.5B-Instruct-gguf-japanese-imatrix-128Kをみてください
See dahara1/Qwen2.5-0.5B-Instruct-gguf-japanese-imatrix-128K for CUDA examles.

クライアントスクリプトの例はdahara1/Qwen2.5-3B-Instruct-gguf-japanese-imatrix-128Kをご覧ください
See dahara1/Qwen2.5-3B-Instruct-gguf-japanese-imatrix-128K for cliant example.

コマンドの詳細はllama.cppの公式ページをご覧ください
For more command information, see the official llama.cpp page.