---
language:
- ja
---

## 本モデルについて about this model.
[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)を[日本語が多く含まれる重要度行列(iMatrix)](dahara1/imatrix-jpn-test)を使って量子化し、超長文(32K以上)要約を可能にしたgguf版です。日本語対応能力が多めに保持されている事を期待しています。  
This is a gguf version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) that has been quantized using [importance matrix (iMatrix) that contains a lot of Japanese](dahara1/imatrix-jpn-test) to enable summarization of long texts (over 32K). We hope that it retains a large amount of Japanese support.

少なくともQwen2.5-3B-Instruct-gguf-japanese-imatrix-128K/Qwen2.5-3B-Instruct-Q8_0-f16.ggufが32Kトークンを超える超長文を正しく要約できる事を確認済です。  
It has been confirmed that at least Qwen2.5-3B-Instruct-gguf-japanese-imatrix-128K/Qwen2.5-3B-Instruct-Q8_0-f16.gguf can correctly summarize extremely long texts exceeding 32K tokens.  

128Kコンテキスト延長については[unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF](https://huggingface.co/unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF)の指摘を参考にしています。ありがとう。  
Regarding the 128K context extension, I have taken note of the suggestion made by [unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF](https://huggingface.co/unsloth/Qwen2.5-Coder-32B-Instruct-128K-GGUF). Thank you.  


## For ollama users
ollama ユーザーは[FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md)を参考にしてcontext window sizeパラメーターを修正してください。  
If you use ollama, check [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md) and set context window size parameter like below.

```
/set parameter num_ctx 40960
```
or API 
```
curl http://..../api/generate -d '{
  "model": ".....",
  "prompt": "......",
  "options": {
    "num_ctx": 40960
  }
}'
```

あなたが他のツールを使っている場合、同様にあなたの使っているツールのマニュアルを調べて、コンテキストウインドウサイズを延長する事を忘れないでください  
ただし、コンテキストサイズを必要以上に大きくするとモデルの実行速度が低下するので注意してください  
本モデルは理論上、最大値128K(131072)に設定できますが、実行速度と品質に影響が出る事が考えられます  

If you are using other tools, be sure to extend the context window size as well, by consulting the manual of your tool.  
But please note that increasing the context window size more than necessary will slow down the model's execution speed.  
In theory, this model can be set to the maximum value of 128K(131072), but this may affect execution speed and quality.  


## Sample llama.cpp script

以下は、Wikipediaの約50,000文字(34.8Kトークン)の記事を取得して内容を要約するサンプルです  
Below is a sample that retrieves a Wikipedia article of about 50,000 Japanese characters(34.8K tokens) and summarizes its contents.  


llama.cpp server command sample.
```
./llama.cpp/build/bin/Release/llama-server.exe -m ./Qwen2.5-3B-Instruct-Q8_0-f16.gguf -c 40960
```


llama.cpp client script sample.
```
import transformers
import requests
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

url = "https://ja.wikipedia.org/wiki/%E7%94%B7%E3%81%AE%E5%A8%98"

def get_wikipedia_text(url):
    response = requests.get(url)
    if response.status_code == 200:
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = "\n".join([p.get_text() for p in paragraphs])
        return text
    else:
        raise Exception(f"Failed to fetch the article. Status code: {response.status_code}")
        
if __name__ == "__main__":

    html_text = get_wikipedia_text(url)
    #html_text = html_text[:40000]

    instruct = "### 指示\n\n上記の文章を日本語で３行で要約してください"

    # instruct first version
    messages  = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": instruct + "\n\n" + html_text},
    ]

    # instruct last version
    messages  = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": html_text  + "\n\n" + instruct},
    ]
    
    prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False
    )
    print(prompt)

    payload = {
            "prompt": prompt,
            "n_predict": 512
    }

    url = "http://localhost:8080/completion"
    headers = {
        "Content-Type": "application/json"
    }

    response = requests.post(url, headers=headers, data=json.dumps(payload))
    if response.status_code != 200:
        print(f"Error: {response.text}")

    response_data = response.json()

    response_content = response_data.get('content', '').strip()
    print(response_content)
```

### 出力結果(output sample)

#### This 128K model
128K instruct first version  
![128K instruct first version](128k_full_head.png)

128K instruct last version  
![128K instruct last version](128k_full_tail.png)

#### Standard 32K model
32K instruct first version  
![32K instruct first version](32k_full_head.png)

32K instruct last version  
![32K instruct last version](32k_full_tail.png)


32K instruct first versionでは要約指示がコンテキスト外になっており、指示が無視されている事に注目してください  
Notice that in the 32K instruct first version the summary instruction is out of context and the instruction is ignored.