Infinigence
/

Megrez-3B-Instruct

@@ -23,34 +23,31 @@ pipeline_tag: text-generation
 </h4>
 ## Introduction
-Megrez-3B is a large language model trained by [Infinigence AI](https://cloud.infini-ai.com/platform/ai). Megrez-3B aims to provide a fast inference, compact, and powerful edge-side intelligent solution through software-hardware co-design. Megrez-3B has the following advantages:
 1. High Accuracy: Megrez-3B successfully compresses the capabilities of the previous 14 billion model into a 3 billion size, and achieves excellent performance on mainstream benchmarks.
 2. High Speed: A smaller model does not necessarily bring faster speed. Megrez-3B ensures a high degree of compatibility with mainstream hardware through software-hardware co-design, leading an inference speedup up to 300% compared to previous models of the same accuracy.
 3. Easy to Use: In the beginning, we had a debate about model design: should we design a unique but efficient model structure, or use a classic structure for ease of use? We chose the latter and adopt the most primitive LLaMA structure, which allows developers to deploy the model on various platforms without any modifications and minimize the complexity of future development.
 4. Rich Applications: We have provided a fullstack WebSearch solution. Our model is functionally trained on web search tasks, enabling it to automatically determine the timing of search invocations and provide better summarization results. The complete deployment code is released on [github](https://github.com/infinigence/InfiniWebSearch).
 ## Model Card
 * Model name: Megrez-3B-Instruct
 * Architecture: Llama-2 with GQA
 * Context length: 32K tokens
 * Params (Total): 2.92B
-* Params (Backbone only, w/o Emb or Softmax):  2.29B
 * Vocab Size: 122880
 * Training data: 3T tokens
 * Supported languages: Chinese & English
-For text generation and other tasks that benefit from diversity, we recommend using the inference parameter temperature=0.7. For mathematical and reasoning tasks, we recommend using the inference parameter temperature=0.2.
 ## Performance
-We have evaluated Megrez-3B-Instruct using the open-source evaluation tool [OpenCompass](https://github.com/open-compass/opencompass) on several important benchmarks. Some of the evaluation results are shown in the table below.
 ### General Benchmarks
-|        Models         | chat model | # Non-Emb Params | Inference Speed (tokens/s) | C-EVAL | CMMLU | MMLU  | MMLU-Pro | HumanEval | MBPP  | GSM8K | MATH  |
 |:---------------------:|:--------:|:---------------:|:-------------------:|:------:|:-----:|:-----:|:--------:|:---------:|:-----:|:-----:|:-----:|
 | Megrez-3B-Instruct    |     Y    |       2.3       |       2329.4        |  84.8  | 74.7  | 72.8  |   46.1   |   78.7    | 71.0  | 65.5  | 28.3  |
 | Qwen2-1.5B            |          |       1.3       |       3299.5        |  70.6  | 70.3  | 56.5  |   21.8   |   31.1    | 37.4  | 58.5  | 21.7  |
@@ -74,10 +71,9 @@ We have evaluated Megrez-3B-Instruct using the open-source evaluation tool [Open
 - For details on the configuration for measuring speed, please refer to <a href="https://huggingface.co/Infinigence/Megrez-3B-Instruct/blob/main/README_SPEED.md">README_SPEED.md</a>
 ### Chat Benchmarks
 This table only includes models with official MT-Bench or AlignBench benchmarks.
-|       Models        | # Non-Emb Params | Inference Speed (tokens/s) | MT-Bench | AlignBench (ZH) |
 |---------------------|--------------------------------------|:--------------------------:|:--------:|:---------------:|
 | Megrez-3B-Instruct  |                 2.3                  |           2329.4           |   8.64   |      7.06       |
 | MiniCPM-2B-sft-bf16 |                 2.4                  |           1930.8           |     -    |      4.64       |
@@ -92,7 +88,6 @@ This table only includes models with official MT-Bench or AlignBench benchmarks.
 | Baichuan2-13B-Chat  |                 12.6                 |           756.7            |     -    |      5.25       |
 ### LLM Leaderboard
 |         Models         | # Non-Emb Params | Inference Speed (tokens/s) | IFEval |  BBH  | ARC_C | HellaSwag | WinoGrande | TriviaQA |
 |-----------------------|--------------------------------------|:--------------------------:|:--------------------:|:----:|:-----:|:---------:|:----------:|:--------:|
 | Megrez-3B-Instruct    |                 2.3                  |           2329.4           |         68.6         | 72.6 |  95.6 |    83.9   |    78.8    |   81.6   |
@@ -109,36 +104,27 @@ This table only includes models with official MT-Bench or AlignBench benchmarks.
 |                        | single-document-qa | multi-document-qa | summarization | few-shot-learning | synthetic-tasks | code-completion  | Average |
 |------------------------|:------------------:|:-----------------:|:-------------:|:-----------------:|:---------------:|:----------------:|:-------:|
-| Megrez-3B-Instruct-32K |        39.67       |       55.53       |      24.51    |        62.52      |        68.5     |       66.73      |  52.91  |
-| GPT-3.5-Turbo-16k      |        50.5        |        33.7       |     21.25     |       48.15       |       54.1      |       54.1       |  43.63  |
-| ChatGLM3-6B-32k        |        51.3        |        45.7       |     23.65     |       55.05       |       56.2      |       56.2       |  48.02  |
-| InternLM2-Chat-7B-SFT  |        47.3        |        45.2       |      25.3     |        59.9       |       67.2      |       43.5       |  48.07  |
 #### Longbench-Chat
 |                          | Longbench-Chat |
 |--------------------------|----------------|
-| Megrez-3B-Instruct(32K)  | 4.98           |
 | Vicuna-7b-v1.5-16k       | 3.51           |
 | Mistral-7B-Instruct-v0.2 | 5.84           |
 | ChatGLM3-6B-128k         | 6.52           |
 | GLM-4-9B-Chat            | 7.72           |
 #### Needle In A Haystack - Pressure Test
-Regarding the Megrez-3B-Instruct model's performance under the 32K text retrieval stress test.
 ![NeedleTest](assets/needle_test.png)
-### Inference Speed
-The scatter plot of speed, accuracy and model size is roughly as follows:
-![MMLU](assets/mmlu_en.png)
-![MTBench](assets/chat_en.png)
 ## WebSearch
 We have provided a fullstack WebSearch solution which has the following advantages:
 1. Automatically determine the timing of search invocations: Switch between search and conversation automatically without tendency.
 2. In-Context understanding: Generate reasonable search queries or process search results based on multi-turn conversations.
 3. Structured output: Each conclusion is attributed to its source for easy verification.
@@ -148,15 +134,14 @@ Our model is functionally trained on web search tasks. Users can build their own
 ![WebSearchDemo](assets/websearch.gif)
 ## Quick Start
 ### Online Experience
 [MaaS](https://cloud.infini-ai.com/genstudio/model/mo-c73owqiotql7lozr)(recommend)
 ### Huggingface
-It is recommended to use the **temperature=0.2** parameter for inference to achieve better reasoning capabilities.
 ``` python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
@@ -188,17 +173,13 @@ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0
 print(responses)
 ```
-### vLLM 推理
 - Installation
 ```bash
 # Install vLLM with CUDA 12.1.
 pip install vllm
 ```
 - Example code
 ```python
 python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
 from transformers import AutoTokenizer
@@ -223,7 +204,6 @@ print(outputs[0].outputs[0].text)
 ```
 ## License and Statement
 - License: Our models are released under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
 - Hallucination: LLMs inherently suffer from hallucination issues. Users are advised not to fully trust the content generated by the model. If accurate outputs are required, we recommend utilizing our WebSearch framework as detailed in [InfiniWebSearch](https://github.com/paxionfull/InfiniWebSearch).
 - Mathematics & Reasoning: Small LLMs tend to produce more incorrect calculations or flawed reasoning chains in tasks like mathematics and reasoning. Notably, the softmax distribution of Small LLMs is less sharp compared to LLMs, making them more sensitive to inconsistent reasoning results, especially under higher temperature settings. We recommend lowering the temperature or verifying through multiple inference attempts in deterministic tasks such as mathematics and logical reasoning.

 </h4>
 ## Introduction
+Megrez-3B-Instruct is a large language model trained by [Infinigence AI](https://cloud.infini-ai.com/platform/ai). Megrez-3B aims to provide a fast inference, compact, and powerful edge-side intelligent solution through software-hardware co-design. Megrez-3B has the following advantages:
 1. High Accuracy: Megrez-3B successfully compresses the capabilities of the previous 14 billion model into a 3 billion size, and achieves excellent performance on mainstream benchmarks.
 2. High Speed: A smaller model does not necessarily bring faster speed. Megrez-3B ensures a high degree of compatibility with mainstream hardware through software-hardware co-design, leading an inference speedup up to 300% compared to previous models of the same accuracy.
 3. Easy to Use: In the beginning, we had a debate about model design: should we design a unique but efficient model structure, or use a classic structure for ease of use? We chose the latter and adopt the most primitive LLaMA structure, which allows developers to deploy the model on various platforms without any modifications and minimize the complexity of future development.
 4. Rich Applications: We have provided a fullstack WebSearch solution. Our model is functionally trained on web search tasks, enabling it to automatically determine the timing of search invocations and provide better summarization results. The complete deployment code is released on [github](https://github.com/infinigence/InfiniWebSearch).
 ## Model Card
 * Model name: Megrez-3B-Instruct
 * Architecture: Llama-2 with GQA
 * Context length: 32K tokens
 * Params (Total): 2.92B
+* Params (Backbone only, w/o Emb or Softmax): 2.29B
 * Vocab Size: 122880
 * Training data: 3T tokens
 * Supported languages: Chinese & English
 ## Performance
+We evaluated Megrez-3B-Instruct using the open-source evaluation tool [OpenCompass](https://github.com/open-compass/opencompass) on several important benchmarks. Some of the evaluation results are shown in the table below.
+The scatter plot of speed, accuracy and model size is roughly as follows. The point size represents the number of model parameters.
+![MMLU](assets/mmlu_en.png)
+![MTBench](assets/chat_en.png)
 ### General Benchmarks
+|        Models         | chat model | # Non-Emb Params | Decode Speed (tokens/s) | C-EVAL | CMMLU | MMLU  | MMLU-Pro | HumanEval | MBPP  | GSM8K | MATH  |
 |:---------------------:|:--------:|:---------------:|:-------------------:|:------:|:-----:|:-----:|:--------:|:---------:|:-----:|:-----:|:-----:|
 | Megrez-3B-Instruct    |     Y    |       2.3       |       2329.4        |  84.8  | 74.7  | 72.8  |   46.1   |   78.7    | 71.0  | 65.5  | 28.3  |
 | Qwen2-1.5B            |          |       1.3       |       3299.5        |  70.6  | 70.3  | 56.5  |   21.8   |   31.1    | 37.4  | 58.5  | 21.7  |
 - For details on the configuration for measuring speed, please refer to <a href="https://huggingface.co/Infinigence/Megrez-3B-Instruct/blob/main/README_SPEED.md">README_SPEED.md</a>
 ### Chat Benchmarks
 This table only includes models with official MT-Bench or AlignBench benchmarks.
+|       Models        | # Non-Emb Params | Decode Speed (tokens/s) | MT-Bench | AlignBench (ZH) |
 |---------------------|--------------------------------------|:--------------------------:|:--------:|:---------------:|
 | Megrez-3B-Instruct  |                 2.3                  |           2329.4           |   8.64   |      7.06       |
 | MiniCPM-2B-sft-bf16 |                 2.4                  |           1930.8           |     -    |      4.64       |
 | Baichuan2-13B-Chat  |                 12.6                 |           756.7            |     -    |      5.25       |
 ### LLM Leaderboard
 |         Models         | # Non-Emb Params | Inference Speed (tokens/s) | IFEval |  BBH  | ARC_C | HellaSwag | WinoGrande | TriviaQA |
 |-----------------------|--------------------------------------|:--------------------------:|:--------------------:|:----:|:-----:|:---------:|:----------:|:--------:|
 | Megrez-3B-Instruct    |                 2.3                  |           2329.4           |         68.6         | 72.6 |  95.6 |    83.9   |    78.8    |   81.6   |
 |                        | single-document-qa | multi-document-qa | summarization | few-shot-learning | synthetic-tasks | code-completion  | Average |
 |------------------------|:------------------:|:-----------------:|:-------------:|:-----------------:|:---------------:|:----------------:|:-------:|
+| Megrez-3B-Instruct    |        39.7        |        55.5       |      24.5     |        62.5       |        68.5       |       66.7      |  52.9  |
+| GPT-3.5-Turbo-16k     |        50.5        |        33.7       |     21.25     |       48.15       |       54.1      |       54.1       |  43.63  |
+| ChatGLM3-6B-32k       |        51.3        |        45.7       |     23.65     |       55.05       |       56.2      |       56.2       |  48.02  |
+| InternLM2-Chat-7B-SFT |        47.3        |        45.2       |      25.3     |        59.9       |       67.2      |       43.5       |  48.07  |
 #### Longbench-Chat
 |                          | Longbench-Chat |
 |--------------------------|----------------|
+| Megrez-3B-Instruct       | 4.98           |
 | Vicuna-7b-v1.5-16k       | 3.51           |
 | Mistral-7B-Instruct-v0.2 | 5.84           |
 | ChatGLM3-6B-128k         | 6.52           |
 | GLM-4-9B-Chat            | 7.72           |
 #### Needle In A Haystack - Pressure Test
+Megrez-3B-Instruct passes the 32K text retrieval stress test.
 ![NeedleTest](assets/needle_test.png)
 ## WebSearch
 We have provided a fullstack WebSearch solution which has the following advantages:
 1. Automatically determine the timing of search invocations: Switch between search and conversation automatically without tendency.
 2. In-Context understanding: Generate reasonable search queries or process search results based on multi-turn conversations.
 3. Structured output: Each conclusion is attributed to its source for easy verification.
 ![WebSearchDemo](assets/websearch.gif)
 ## Quick Start
 ### Online Experience
 [MaaS](https://cloud.infini-ai.com/genstudio/model/mo-c73owqiotql7lozr)(recommend)
+### Inference Parameters
+- For chat, text generation, and other tasks that benefit from diversity, we recommend to use the inference parameter temperature=0.7.
+- For mathematical and reasoning tasks, we recommend to use the inference parameter temperature=0.2 for better determinacy.
 ### Huggingface
 ``` python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 print(responses)
 ```
+### vLLM Inference
 - Installation
 ```bash
 # Install vLLM with CUDA 12.1.
 pip install vllm
 ```
 - Example code
 ```python
 python inference/inference_vllm.py --model_path <hf_repo_path> --prompt_path prompts/prompt_demo.txt
 from transformers import AutoTokenizer
 ```
 ## License and Statement
 - License: Our models are released under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
 - Hallucination: LLMs inherently suffer from hallucination issues. Users are advised not to fully trust the content generated by the model. If accurate outputs are required, we recommend utilizing our WebSearch framework as detailed in [InfiniWebSearch](https://github.com/paxionfull/InfiniWebSearch).
 - Mathematics & Reasoning: Small LLMs tend to produce more incorrect calculations or flawed reasoning chains in tasks like mathematics and reasoning. Notably, the softmax distribution of Small LLMs is less sharp compared to LLMs, making them more sensitive to inconsistent reasoning results, especially under higher temperature settings. We recommend lowering the temperature or verifying through multiple inference attempts in deterministic tasks such as mathematics and logical reasoning.