dpxlbx commited on
Commit
bde56ab
1 Parent(s): 2742582

Update README_EN.md

Browse files
Files changed (1) hide show
  1. README_EN.md +18 -38
README_EN.md CHANGED
@@ -23,34 +23,31 @@ pipeline_tag: text-generation
23
  </h4>
24
 
25
  ## Introduction
26
-
27
- Megrez-3B is a large language model trained by [Infinigence AI](https://cloud.infini-ai.com/platform/ai). Megrez-3B aims to provide a fast inference, compact, and powerful edge-side intelligent solution through software-hardware co-design. Megrez-3B has the following advantages:
28
-
29
  1. High Accuracy: Megrez-3B successfully compresses the capabilities of the previous 14 billion model into a 3 billion size, and achieves excellent performance on mainstream benchmarks.
30
  2. High Speed: A smaller model does not necessarily bring faster speed. Megrez-3B ensures a high degree of compatibility with mainstream hardware through software-hardware co-design, leading an inference speedup up to 300% compared to previous models of the same accuracy.
31
  3. Easy to Use: In the beginning, we had a debate about model design: should we design a unique but efficient model structure, or use a classic structure for ease of use? We chose the latter and adopt the most primitive LLaMA structure, which allows developers to deploy the model on various platforms without any modifications and minimize the complexity of future development.
32
  4. Rich Applications: We have provided a fullstack WebSearch solution. Our model is functionally trained on web search tasks, enabling it to automatically determine the timing of search invocations and provide better summarization results. The complete deployment code is released on [github](https://github.com/infinigence/InfiniWebSearch).
33
 
34
  ## Model Card
35
-
36
  * Model name: Megrez-3B-Instruct
37
  * Architecture: Llama-2 with GQA
38
  * Context length: 32K tokens
39
  * Params (Total): 2.92B
40
- * Params (Backbone only, w/o Emb or Softmax): 2.29B
41
  * Vocab Size: 122880
42
  * Training data: 3T tokens
43
  * Supported languages: Chinese & English
44
 
45
- For text generation and other tasks that benefit from diversity, we recommend using the inference parameter temperature=0.7. For mathematical and reasoning tasks, we recommend using the inference parameter temperature=0.2.
46
-
47
  ## Performance
 
48
 
49
- We have evaluated Megrez-3B-Instruct using the open-source evaluation tool [OpenCompass](https://github.com/open-compass/opencompass) on several important benchmarks. Some of the evaluation results are shown in the table below.
 
 
50
 
51
  ### General Benchmarks
52
-
53
- | Models | chat model | # Non-Emb Params | Inference Speed (tokens/s) | C-EVAL | CMMLU | MMLU | MMLU-Pro | HumanEval | MBPP | GSM8K | MATH |
54
  |:---------------------:|:--------:|:---------------:|:-------------------:|:------:|:-----:|:-----:|:--------:|:---------:|:-----:|:-----:|:-----:|
55
  | Megrez-3B-Instruct | Y | 2.3 | 2329.4 | 84.8 | 74.7 | 72.8 | 46.1 | 78.7 | 71.0 | 65.5 | 28.3 |
56
  | Qwen2-1.5B | | 1.3 | 3299.5 | 70.6 | 70.3 | 56.5 | 21.8 | 31.1 | 37.4 | 58.5 | 21.7 |
@@ -74,10 +71,9 @@ We have evaluated Megrez-3B-Instruct using the open-source evaluation tool [Open
74
  - For details on the configuration for measuring speed, please refer to <a href="https://huggingface.co/Infinigence/Megrez-3B-Instruct/blob/main/README_SPEED.md">README_SPEED.md</a>
75
 
76
  ### Chat Benchmarks
77
-
78
  This table only includes models with official MT-Bench or AlignBench benchmarks.
79
 
80
- | Models | # Non-Emb Params | Inference Speed (tokens/s) | MT-Bench | AlignBench (ZH) |
81
  |---------------------|--------------------------------------|:--------------------------:|:--------:|:---------------:|
82
  | Megrez-3B-Instruct | 2.3 | 2329.4 | 8.64 | 7.06 |
83
  | MiniCPM-2B-sft-bf16 | 2.4 | 1930.8 | - | 4.64 |
@@ -92,7 +88,6 @@ This table only includes models with official MT-Bench or AlignBench benchmarks.
92
  | Baichuan2-13B-Chat | 12.6 | 756.7 | - | 5.25 |
93
 
94
  ### LLM Leaderboard
95
-
96
  | Models | # Non-Emb Params | Inference Speed (tokens/s) | IFEval | BBH | ARC_C | HellaSwag | WinoGrande | TriviaQA |
97
  |-----------------------|--------------------------------------|:--------------------------:|:--------------------:|:----:|:-----:|:---------:|:----------:|:--------:|
98
  | Megrez-3B-Instruct | 2.3 | 2329.4 | 68.6 | 72.6 | 95.6 | 83.9 | 78.8 | 81.6 |
@@ -109,36 +104,27 @@ This table only includes models with official MT-Bench or AlignBench benchmarks.
109
 
110
  | | single-document-qa | multi-document-qa | summarization | few-shot-learning | synthetic-tasks | code-completion | Average |
111
  |------------------------|:------------------:|:-----------------:|:-------------:|:-----------------:|:---------------:|:----------------:|:-------:|
112
- | Megrez-3B-Instruct-32K | 39.67 | 55.53 | 24.51 | 62.52 | 68.5 | 66.73 | 52.91 |
113
- | GPT-3.5-Turbo-16k | 50.5 | 33.7 | 21.25 | 48.15 | 54.1 | 54.1 | 43.63 |
114
- | ChatGLM3-6B-32k | 51.3 | 45.7 | 23.65 | 55.05 | 56.2 | 56.2 | 48.02 |
115
- | InternLM2-Chat-7B-SFT | 47.3 | 45.2 | 25.3 | 59.9 | 67.2 | 43.5 | 48.07 |
116
 
117
  #### Longbench-Chat
118
 
119
  | | Longbench-Chat |
120
  |--------------------------|----------------|
121
- | Megrez-3B-Instruct(32K) | 4.98 |
122
  | Vicuna-7b-v1.5-16k | 3.51 |
123
  | Mistral-7B-Instruct-v0.2 | 5.84 |
124
  | ChatGLM3-6B-128k | 6.52 |
125
  | GLM-4-9B-Chat | 7.72 |
126
 
127
  #### Needle In A Haystack - Pressure Test
128
- Regarding the Megrez-3B-Instruct model's performance under the 32K text retrieval stress test.
129
  ![NeedleTest](assets/needle_test.png)
130
 
131
-
132
- ### Inference Speed
133
-
134
- The scatter plot of speed, accuracy and model size is roughly as follows:
135
- ![MMLU](assets/mmlu_en.png)
136
- ![MTBench](assets/chat_en.png)
137
-
138
  ## WebSearch
139
-
140
  We have provided a fullstack WebSearch solution which has the following advantages:
141
-
142
  1. Automatically determine the timing of search invocations: Switch between search and conversation automatically without tendency.
143
  2. In-Context understanding: Generate reasonable search queries or process search results based on multi-turn conversations.
144
  3. Structured output: Each conclusion is attributed to its source for easy verification.
@@ -148,15 +134,14 @@ Our model is functionally trained on web search tasks. Users can build their own
148
  ![WebSearchDemo](assets/websearch.gif)
149
 
150
  ## Quick Start
151
-
152
  ### Online Experience
153
-
154
  [MaaS](https://cloud.infini-ai.com/genstudio/model/mo-c73owqiotql7lozr)(recommend)
155
 
 
 
 
156
 
157
  ### Huggingface
158
-
159
- It is recommended to use the **temperature=0.2** parameter for inference to achieve better reasoning capabilities.
160
  ``` python
161
  from transformers import AutoModelForCausalLM, AutoTokenizer
162
  import torch
@@ -188,17 +173,13 @@ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0
188
  print(responses)
189
  ```
190
 
191
- ### vLLM 推理
192
-
193
  - Installation
194
-
195
  ```bash
196
  # Install vLLM with CUDA 12.1.
197
  pip install vllm
198
  ```
199
-
200
  - Example code
201
-
202
  ```python
203
  python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
204
  from transformers import AutoTokenizer
@@ -223,7 +204,6 @@ print(outputs[0].outputs[0].text)
223
  ```
224
 
225
  ## License and Statement
226
-
227
  - License: Our models are released under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
228
  - Hallucination: LLMs inherently suffer from hallucination issues. Users are advised not to fully trust the content generated by the model. If accurate outputs are required, we recommend utilizing our WebSearch framework as detailed in [InfiniWebSearch](https://github.com/paxionfull/InfiniWebSearch).
229
  - Mathematics & Reasoning: Small LLMs tend to produce more incorrect calculations or flawed reasoning chains in tasks like mathematics and reasoning. Notably, the softmax distribution of Small LLMs is less sharp compared to LLMs, making them more sensitive to inconsistent reasoning results, especially under higher temperature settings. We recommend lowering the temperature or verifying through multiple inference attempts in deterministic tasks such as mathematics and logical reasoning.
 
23
  </h4>
24
 
25
  ## Introduction
26
+ Megrez-3B-Instruct is a large language model trained by [Infinigence AI](https://cloud.infini-ai.com/platform/ai). Megrez-3B aims to provide a fast inference, compact, and powerful edge-side intelligent solution through software-hardware co-design. Megrez-3B has the following advantages:
 
 
27
  1. High Accuracy: Megrez-3B successfully compresses the capabilities of the previous 14 billion model into a 3 billion size, and achieves excellent performance on mainstream benchmarks.
28
  2. High Speed: A smaller model does not necessarily bring faster speed. Megrez-3B ensures a high degree of compatibility with mainstream hardware through software-hardware co-design, leading an inference speedup up to 300% compared to previous models of the same accuracy.
29
  3. Easy to Use: In the beginning, we had a debate about model design: should we design a unique but efficient model structure, or use a classic structure for ease of use? We chose the latter and adopt the most primitive LLaMA structure, which allows developers to deploy the model on various platforms without any modifications and minimize the complexity of future development.
30
  4. Rich Applications: We have provided a fullstack WebSearch solution. Our model is functionally trained on web search tasks, enabling it to automatically determine the timing of search invocations and provide better summarization results. The complete deployment code is released on [github](https://github.com/infinigence/InfiniWebSearch).
31
 
32
  ## Model Card
 
33
  * Model name: Megrez-3B-Instruct
34
  * Architecture: Llama-2 with GQA
35
  * Context length: 32K tokens
36
  * Params (Total): 2.92B
37
+ * Params (Backbone only, w/o Emb or Softmax): 2.29B
38
  * Vocab Size: 122880
39
  * Training data: 3T tokens
40
  * Supported languages: Chinese & English
41
 
 
 
42
  ## Performance
43
+ We evaluated Megrez-3B-Instruct using the open-source evaluation tool [OpenCompass](https://github.com/open-compass/opencompass) on several important benchmarks. Some of the evaluation results are shown in the table below.
44
 
45
+ The scatter plot of speed, accuracy and model size is roughly as follows. The point size represents the number of model parameters.
46
+ ![MMLU](assets/mmlu_en.png)
47
+ ![MTBench](assets/chat_en.png)
48
 
49
  ### General Benchmarks
50
+ | Models | chat model | # Non-Emb Params | Decode Speed (tokens/s) | C-EVAL | CMMLU | MMLU | MMLU-Pro | HumanEval | MBPP | GSM8K | MATH |
 
51
  |:---------------------:|:--------:|:---------------:|:-------------------:|:------:|:-----:|:-----:|:--------:|:---------:|:-----:|:-----:|:-----:|
52
  | Megrez-3B-Instruct | Y | 2.3 | 2329.4 | 84.8 | 74.7 | 72.8 | 46.1 | 78.7 | 71.0 | 65.5 | 28.3 |
53
  | Qwen2-1.5B | | 1.3 | 3299.5 | 70.6 | 70.3 | 56.5 | 21.8 | 31.1 | 37.4 | 58.5 | 21.7 |
 
71
  - For details on the configuration for measuring speed, please refer to <a href="https://huggingface.co/Infinigence/Megrez-3B-Instruct/blob/main/README_SPEED.md">README_SPEED.md</a>
72
 
73
  ### Chat Benchmarks
 
74
  This table only includes models with official MT-Bench or AlignBench benchmarks.
75
 
76
+ | Models | # Non-Emb Params | Decode Speed (tokens/s) | MT-Bench | AlignBench (ZH) |
77
  |---------------------|--------------------------------------|:--------------------------:|:--------:|:---------------:|
78
  | Megrez-3B-Instruct | 2.3 | 2329.4 | 8.64 | 7.06 |
79
  | MiniCPM-2B-sft-bf16 | 2.4 | 1930.8 | - | 4.64 |
 
88
  | Baichuan2-13B-Chat | 12.6 | 756.7 | - | 5.25 |
89
 
90
  ### LLM Leaderboard
 
91
  | Models | # Non-Emb Params | Inference Speed (tokens/s) | IFEval | BBH | ARC_C | HellaSwag | WinoGrande | TriviaQA |
92
  |-----------------------|--------------------------------------|:--------------------------:|:--------------------:|:----:|:-----:|:---------:|:----------:|:--------:|
93
  | Megrez-3B-Instruct | 2.3 | 2329.4 | 68.6 | 72.6 | 95.6 | 83.9 | 78.8 | 81.6 |
 
104
 
105
  | | single-document-qa | multi-document-qa | summarization | few-shot-learning | synthetic-tasks | code-completion | Average |
106
  |------------------------|:------------------:|:-----------------:|:-------------:|:-----------------:|:---------------:|:----------------:|:-------:|
107
+ | Megrez-3B-Instruct | 39.7 | 55.5 | 24.5 | 62.5 | 68.5 | 66.7 | 52.9 |
108
+ | GPT-3.5-Turbo-16k | 50.5 | 33.7 | 21.25 | 48.15 | 54.1 | 54.1 | 43.63 |
109
+ | ChatGLM3-6B-32k | 51.3 | 45.7 | 23.65 | 55.05 | 56.2 | 56.2 | 48.02 |
110
+ | InternLM2-Chat-7B-SFT | 47.3 | 45.2 | 25.3 | 59.9 | 67.2 | 43.5 | 48.07 |
111
 
112
  #### Longbench-Chat
113
 
114
  | | Longbench-Chat |
115
  |--------------------------|----------------|
116
+ | Megrez-3B-Instruct | 4.98 |
117
  | Vicuna-7b-v1.5-16k | 3.51 |
118
  | Mistral-7B-Instruct-v0.2 | 5.84 |
119
  | ChatGLM3-6B-128k | 6.52 |
120
  | GLM-4-9B-Chat | 7.72 |
121
 
122
  #### Needle In A Haystack - Pressure Test
123
+ Megrez-3B-Instruct passes the 32K text retrieval stress test.
124
  ![NeedleTest](assets/needle_test.png)
125
 
 
 
 
 
 
 
 
126
  ## WebSearch
 
127
  We have provided a fullstack WebSearch solution which has the following advantages:
 
128
  1. Automatically determine the timing of search invocations: Switch between search and conversation automatically without tendency.
129
  2. In-Context understanding: Generate reasonable search queries or process search results based on multi-turn conversations.
130
  3. Structured output: Each conclusion is attributed to its source for easy verification.
 
134
  ![WebSearchDemo](assets/websearch.gif)
135
 
136
  ## Quick Start
 
137
  ### Online Experience
 
138
  [MaaS](https://cloud.infini-ai.com/genstudio/model/mo-c73owqiotql7lozr)(recommend)
139
 
140
+ ### Inference Parameters
141
+ - For chat, text generation, and other tasks that benefit from diversity, we recommend to use the inference parameter temperature=0.7.
142
+ - For mathematical and reasoning tasks, we recommend to use the inference parameter temperature=0.2 for better determinacy.
143
 
144
  ### Huggingface
 
 
145
  ``` python
146
  from transformers import AutoModelForCausalLM, AutoTokenizer
147
  import torch
 
173
  print(responses)
174
  ```
175
 
176
+ ### vLLM Inference
 
177
  - Installation
 
178
  ```bash
179
  # Install vLLM with CUDA 12.1.
180
  pip install vllm
181
  ```
 
182
  - Example code
 
183
  ```python
184
  python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
185
  from transformers import AutoTokenizer
 
204
  ```
205
 
206
  ## License and Statement
 
207
  - License: Our models are released under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
208
  - Hallucination: LLMs inherently suffer from hallucination issues. Users are advised not to fully trust the content generated by the model. If accurate outputs are required, we recommend utilizing our WebSearch framework as detailed in [InfiniWebSearch](https://github.com/paxionfull/InfiniWebSearch).
209
  - Mathematics & Reasoning: Small LLMs tend to produce more incorrect calculations or flawed reasoning chains in tasks like mathematics and reasoning. Notably, the softmax distribution of Small LLMs is less sharp compared to LLMs, making them more sensitive to inconsistent reasoning results, especially under higher temperature settings. We recommend lowering the temperature or verifying through multiple inference attempts in deterministic tasks such as mathematics and logical reasoning.