File size: 5,604 Bytes
29bdfa3
c3fd341
29bdfa3
36dec03
 
29bdfa3
29880fd
 
c3fd341
 
11912af
4065b4d
36dec03
1f3a8cc
 
3105ad4
 
 
0a9efbe
3849685
1f3a8cc
 
3105ad4
c1b63f5
36dec03
d6c1fd3
 
 
36dec03
0cd4c39
 
cad97e9
abc65e7
09ea439
0cd4c39
d6c1fd3
 
 
 
 
 
 
 
4065b4d
 
 
 
 
 
40f8719
3849685
40f8719
3849685
40f8719
ac15cfd
3849685
 
 
 
 
 
 
 
 
 
 
 
 
 
17c39c9
 
 
3849685
 
 
 
40f8719
3849685
9a06795
 
 
6bee593
9a06795
4065b4d
29880fd
 
 
 
f218129
4065b4d
29880fd
 
 
baf4576
29880fd
 
 
 
38f5908
 
 
 
 
 
fa37336
09ea439
7d4d81d
 
dc5d441
7d4d81d
8dbfd55
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: cc-by-nc-4.0
inference: false
language:
  - ja
---
# weblab-10b-instruction-sft-GPTQ 

Original model [weblab-10b-instruction-sft](https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft) which is a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters created by matsuo-lab
 Takeshi Kojima.

This model is a quantized(miniaturized) version of the original model(21.42GB).

There are currently two well-known quantization version of original model.  
(1)GPTQ version(This model. 6.3 GB)  
The size is smaller and the execution speed is faster, but the inference performance may be a little worse than original model.  
At least one GPU is currently required due to a limitation of the Accelerate library.  
So this model cannot be run with the huggingface space free version.  
You need autoGPTQ library to use this model.  

(2)llama.cpp version(gguf)([matsuolab-weblab-10b-instruction-sft-gguf](https://huggingface.co/mmnga/matsuolab-weblab-10b-instruction-sft-gguf) 6.03GB)  
created by mmnga.  
You can use gguf model with llama.cpp at cpu only machine.  
But maybe gguf model little bit slower then GPTQ especialy long text.

# How to run.

## Local PC

You can use [text-generation-webui](https://github.com/oobabooga/text-generation-webui) to run this model fast(about 16 tokens/s on my RTX 3060) on your local PC.

![text-generation-webui-sample](./text-generation-webui-sample.png "text-generation-webui")

The explanation of [how to install text-generation-webui in  Japanese is here.](https://webbigdata.jp/post-19926/).

### colab with GUI

You can try this model interactively in the free version of Colab.  
[weblab-10b-instruction-sft-GPTQ-text-generation-webui-colab](https://github.com/webbigdata-jp/python_sample/blob/main/weblab_10b_instruction_sft_GPTQ_text_generation_webui_colab.ipynb)  

![text-generation-webui-sample](./text-generation-webui-colab-sample.png "text-generation-webui-colab")

### colab simple sample code

Currently, models may behave differently on local PC and Colab. On Colab, the model may not respond if you include instructional prompts.  
[Colab Sample script](https://github.com/webbigdata-jp/python_sample/blob/main/weblab_10b_instruction_sft_GPTQ_sample.ipynb)  

If you get an error (something not found or something is not defined) in the script below, please refer to the official documentation and Colab samples and specify a specific version.

```
pip install auto-gptq
```

```
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

quantized_model_dir = "dahara1/weblab-10b-instruction-sft-GPTQ"
model_basename = "gptq_model-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

model = AutoGPTQForCausalLM.from_quantized(
        quantized_model_dir,
        model_basename=model_basename,
        use_safetensors=True,
        device="cuda:0")


prompt_text = "スタジオジブリの作品を5つ教えてください"
prompt_template = f'以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{prompt_text}\n\n### 応答:'

tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0").input_ids
output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0]))
```

### How to make finetune data(LoRA)

There is a LoRA finetune code sample in finetune_sample directory.  
please read [README.mb](https://huggingface.co/dahara1/weblab-10b-instruction-sft-GPTQ/blob/main/finetune_sample/README.md)

### Other AutoGPTQ documents
https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md

### Benchmark

The results below are preliminary. The blank part is under measurement.  
Also, the score may change as a result of more tuning.

* **Japanese benchmark**

    - *We used [Stability-AI/lm-evaluation-harness + gakada's AutoGPTQ PR](https://github.com/webbigdata-jp/lm-evaluation-harness) for evaluation. ([Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable) + [gakada's AutoGPTQ PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/519))*
    - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
    - *model loading is performed with gptq_use_triton=True, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
    - *The number of few-shots is 3,3,3,2.*
   
    | Model | Average | JCommonsenseQA | JNLI | MARC-ja | JSQuAD | model |   
    | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
    | weblab-10b | 66.38 | 65.86 | 54.19 | 84.49 | 60.98 | [original model](https://huggingface.co/matsuo-lab/weblab-10b) |
    | weblab-10b-instruction-sft | 78.78 | 74.35 | 65.65 | 96.06 | 79.04 | [original instruction model](https://huggingface.co/matsuo-lab/weblab-10b-instruction-sft) |
    | *weblab-10b-instruction-sft-GPTQ first tuning* | 69.72 | 74.53 | 41.70 | 89.95 | 72.69 | deleted |
    | *weblab-10b-instruction-sft-GPTQ second tuning* | 74.59 | 74.08 | 60.72 | 91.85 | 71.70 | deleted |
    | *weblab-10b-instruction-sft-GPTQ third tuning* | 77.62 | 73.19 | 69.26 | 95.91 | 72.10 | current model. replaced on August 26th |
    | *weblab-10b-instruction-sft-GPTQ 4th tuning* | - | - | 14.5 | 85.46 |  | abandoned |


  
## about this work
- **This Quantization work was done by :** [webbigdata](https://webbigdata.jp/).
- [related documentation like fine-turning in Japanesse is here.](https://webbigdata.jp/post-20104/)