Update README.md
Browse files
README.md
CHANGED
@@ -19,68 +19,18 @@ Take Hammer2.0-7b as an example, it is a fine-tuned model based on [Qwen2.5-Code
|
|
19 |
Thanks so much for your attention, a report with all the technical details leading to our models will be published soon.
|
20 |
|
21 |
## Evaluation
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
| 7 | 55.82 | mistral-large-2407 (FC) | 84.12 | 57.5 | 94 | 93 | 92 | 83.09 | 76.86 | 92 | 86 | 77.5 | 67.17 | 79.07 | 78.88 | 87.5 | 75 | 20.5 | 29 | 13 | 19.5 | 20.5 | N/A | 78.05 | 48.93 | Mistral AI | Proprietary |
|
35 |
-
| 8 | 55.67 | GPT-4-turbo-2024-04-09 (Prompt) | 91.31 | 82.25 | 94.5 | 95 | 93.5 | 88.12 | 99 | 96 | 80 | 77.5 | 67.97 | 78.68 | 83.12 | 81.25 | 75 | 10.62 | 12.5 | 5.5 | 11 | 13.5 | N/A | 82.93 | 61.82 | OpenAI | Proprietary |
|
36 |
-
| 9 | 54.83 | Claude-3.5-Sonnet-20240620 (FC) | 70.35 | 75.42 | 93.5 | 62 | 50.5 | 66.34 | 95.36 | 86 | 44 | 40 | 71.39 | 72.48 | 70.68 | 68.75 | 75 | 23.5 | 30.5 | 8 | 27 | 28.5 | N/A | 63.41 | 75.91 | Anthropic | Proprietary |
|
37 |
-
| 10 | 53.66 | GPT-4o-2024-08-06 (Prompt) | 80.9 | 64.08 | 86.5 | 88 | 85 | 77.89 | 70.57 | 88 | 78 | 75 | 73.88 | 67.44 | 67.21 | 56.25 | 58.33 | 6.12 | 9 | 1 | 7.5 | 7 | N/A | 53.66 | 89.56 | OpenAI | Proprietary |
|
38 |
-
| 11 | 53.43 | o1-mini-2024-09-12 (Prompt) | 75.48 | 68.92 | 89 | 73.5 | 70.5 | 76.86 | 78.93 | 88 | 78 | 62.5 | 71.17 | 62.79 | 65.09 | 68.75 | 58.33 | 11 | 16 | 2 | 12.5 | 13.5 | N/A | 46.34 | 88.07 | OpenAI | Proprietary |
|
39 |
-
| 12 | 53.01 | Gemini-1.5-Flash-Preview-0514 (FC) | 77.1 | 65.42 | 94.5 | 71.5 | 77 | 71.23 | 57.93 | 84 | 78 | 65 | 71.17 | 62.79 | 72.61 | 56.25 | 54.17 | 13.12 | 17.5 | 4 | 15.5 | 15.5 | N/A | 60.98 | 76.15 | Google | Proprietary |
|
40 |
-
| 13 | 52.53 | Gemini-1.5-Pro-Preview-0514 (FC) | 75.54 | 50.17 | 89.5 | 83.5 | 79 | 77.46 | 71.86 | 86 | 82 | 70 | 69.26 | 60.08 | 66.35 | 75 | 54.17 | 10.87 | 15.5 | 1.5 | 11 | 15.5 | N/A | 60.98 | 80.56 | Google | Proprietary |
|
41 |
-
| | 51.94 | MadeAgents/Hammer2.0-1.5b (FC) | 84.31 | 75.25 | 92.5 | 87.5 | 82 | 81.8 | 83.71 | 90 | 86 | 67.5 | 63.17 | 64.73 | 67.31 | 50 | 66.67 | 11.38 | 14 | 7 | 12 | 12.5 | N/A | 92.68 | 61.83 | MadeAgents | cc-by-nc-4.0 |
|
42 |
-
| 14 | 51.93 | GPT-3.5-Turbo-0125 (FC) | 84.52 | 74.08 | 93 | 87.5 | 83.5 | 81.66 | 95.14 | 88 | 86 | 57.5 | 59 | 65.5 | 74.16 | 56.25 | 54.17 | 19.12 | 30 | 7.5 | 23 | 16 | N/A | 97.56 | 35.83 | OpenAI | Proprietary |
|
43 |
-
| 15 | 51.78 | FireFunction-v2 (FC) | 85.71 | 78.83 | 92 | 91 | 81 | 84.23 | 94.43 | 88 | 82 | 72.5 | 61.71 | 69.38 | 70.97 | 56.25 | 54.17 | 11.62 | 21.5 | 1.5 | 17.5 | 6 | N/A | 87.8 | 52.94 | Fireworks | Apache 2.0 |
|
44 |
-
| 16 | 51.78 | Open-Mistral-Nemo-2407 (FC) | 80.98 | 60.92 | 92 | 85.5 | 85.5 | 81.46 | 91.36 | 86 | 86 | 62.5 | 61.44 | 68.22 | 67.98 | 75 | 62.5 | 14.25 | 21 | 10 | 13.5 | 12.5 | N/A | 65.85 | 59.14 | Mistral AI | Proprietary |
|
45 |
-
| 17 | 51.45 | xLAM-7b-fc-r (FC) | 86.83 | 77.33 | 92.5 | 91.5 | 86 | 85.02 | 91.57 | 88 | 88 | 72.5 | 68.81 | 63.57 | 63.36 | 56.25 | 50 | 0 | 0 | 0 | 0 | 0 | N/A | 80.49 | 79.76 | Salesforce | cc-by-nc-4.0 |
|
46 |
-
| 18 | 51.01 | Gorilla-OpenFunctions-v2 (FC) | 87.29 | 77.67 | 95 | 89 | 87.5 | 84.96 | 95.86 | 96 | 78 | 70 | 68.59 | 63.95 | 63.93 | 62.5 | 45.83 | 0 | 0 | 0 | 0 | 0 | N/A | 85.37 | 73.13 | Gorilla LLM | Apache 2.0 |
|
47 |
-
| | 49.88 | MadeAgents/Hammer2.0-3b (FC) | 86.77 | 77.08 | 92.5 | 89.5 | 88 | 80.25 | 81.5 | 86 | 86 | 67.5 | 66.06 | 63.95 | 72.81 | 56.25 | 66.67 | 0.5 | 1 | 0 | 0.5 | 0.5 | N/A | 92.68 | 68.59 | MadeAgents | cc-by-nc-4.0 |
|
48 |
-
| 19 | 49.63 | Claude-3-Opus-20240229 (FC tools-2024-04-04) | 58.4 | 74.08 | 89.5 | 35 | 35 | 63.16 | 84.64 | 86 | 52 | 30 | 70.5 | 64.73 | 70.4 | 43.75 | 20.83 | 15.62 | 22 | 4 | 14.5 | 22 | N/A | 73.17 | 76.4 | Anthropic | Proprietary |
|
49 |
-
| 20 | 49.55 | Meta-Llama-3-70B-Instruct (Prompt) | 87.21 | 75.83 | 94.5 | 91.5 | 87 | 87.41 | 94.14 | 94 | 84 | 77.5 | 63.39 | 69.77 | 78.01 | 75 | 66.67 | 1.12 | 1.5 | 1.5 | 1 | 0.5 | N/A | 92.68 | 50.63 | Meta | Meta Llama 3 Community |
|
50 |
-
| 21 | 48.14 | Command-R-Plus (Prompt) (Original) | 75.54 | 71.17 | 85 | 80 | 66 | 77.57 | 91.29 | 86 | 78 | 55 | 67.88 | 65.12 | 71.26 | 75 | 58.33 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 75.61 | 69.31 | Cohere For AI | cc-by-nc-4.0 |
|
51 |
-
| 22 | 47.66 | Granite-20b-FunctionCalling (FC) | 82.67 | 73.17 | 92 | 84 | 81.5 | 82.96 | 85.36 | 90 | 84 | 72.5 | 55.89 | 57.36 | 54.1 | 37.5 | 54.17 | 3.63 | 4.5 | 1.5 | 3.5 | 5 | N/A | 95.12 | 72.43 | IBM | Apache-2.0 |
|
52 |
-
| 23 | 45.88 | Hermes-2-Pro-Llama-3-70B (FC) | 81.73 | 65.92 | 80.5 | 90.5 | 90 | 81.29 | 80.64 | 88 | 84 | 72.5 | 58.6 | 66.67 | 62.49 | 50 | 66.67 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 80.49 | 53.8 | NousResearch | apache-2.0 |
|
53 |
-
| 24 | 45.4 | xLAM-1b-fc-r (FC) | 79.17 | 73.17 | 89.5 | 77.5 | 76.5 | 80.5 | 78 | 88 | 86 | 70 | 57.57 | 56.59 | 56.12 | 50 | 58.33 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | N/A | 95.12 | 61.26 | Salesforce | cc-by-nc-4.0 |
|
54 |
-
| 25 | 45.22 | Command-R-Plus (FC) (Original) | 77.65 | 69.58 | 88 | 82.5 | 70.5 | 77.41 | 89.14 | 86 | 82 | 52.5 | 54.24 | 58.91 | 56.89 | 50 | 54.17 | 6.12 | 9.5 | 0 | 6.5 | 8.5 | N/A | 92.68 | 52.75 | Cohere For AI | cc-by-nc-4.0 |
|
55 |
-
| 26 | 44.28 | Hermes-2-Pro-Llama-3-8B (FC) | 77.17 | 64.17 | 91 | 79.5 | 74 | 74.05 | 68.71 | 90 | 80 | 57.5 | 57.8 | 60.47 | 58.92 | 43.75 | 41.67 | 1.88 | 2.5 | 0.5 | 2.5 | 2 | N/A | 53.66 | 55.16 | NousResearch | apache-2.0 |
|
56 |
-
| 27 | 44.23 | Hermes-2-Pro-Mistral-7B (FC) | 73.17 | 62.67 | 85.5 | 77 | 67.5 | 74.25 | 60.5 | 90 | 84 | 62.5 | 54.11 | 59.3 | 57.47 | 43.75 | 33.33 | 9.88 | 12 | 6.5 | 10 | 11 | N/A | 75.61 | 38.55 | NousResearch | apache-2.0 |
|
57 |
-
| 28 | 43.9 | Hermes-2-Theta-Llama-3-8B (FC) | 73.56 | 61.25 | 82.5 | 75.5 | 75 | 72.54 | 69.14 | 88 | 78 | 55 | 59.57 | 55.81 | 53.13 | 43.75 | 41.67 | 1 | 1.5 | 0 | 1 | 1.5 | N/A | 51.22 | 62.66 | NousResearch | apache-2.0 |
|
58 |
-
| 29 | 43 | Open-Mixtral-8x22b (FC) | 56.12 | 50.5 | 95 | 8.5 | 70.5 | 59.7 | 77.79 | 92 | 24 | 45 | 65.3 | 68.99 | 70.49 | 12.5 | 54.17 | 8.88 | 12.5 | 6.5 | 8 | 8.5 | N/A | 85.37 | 44.2 | Mistral AI | Proprietary |
|
59 |
-
| | 39.51 | MadeAgents/Hammer2.0-0.5b (FC) | 67 | 62 | 80 | 68 | 58 | 65.73 | 48.43 | 82 | 80 | 52.5 | 51.62 | 47.67 | 42.14 | 50 | 37.5 | 0 | 0 | 0 | 0 | 0 | N/A | 87.8 | 67 | MadeAgents | cc-by-nc-4.0 |
|
60 |
-
| 30 | 38.39 | Claude-3-Haiku-20240307 (Prompt) | 62.52 | 77.58 | 93 | 47.5 | 32 | 60.73 | 89.43 | 94 | 32 | 27.5 | 58.06 | 71.71 | 75.99 | 56.25 | 58.33 | 1.62 | 2.5 | 0.5 | 1 | 2.5 | N/A | 85.37 | 18.9 | Anthropic | Proprietary |
|
61 |
-
| 31 | 37.77 | Claude-3-Haiku-20240307 (FC tools-2024-04-04) | 42.42 | 74.17 | 93 | 2 | 0.5 | 47.16 | 90.64 | 92 | 6 | 0 | 51.98 | 71.32 | 64.9 | 0 | 4.17 | 18.5 | 25 | 6.5 | 24 | 18.5 | N/A | 97.56 | 29.08 | Anthropic | Proprietary |
|
62 |
-
| 32 | 16.66 | Hermes-2-Theta-Llama-3-70B (FC) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38.87 | | | | | | | | | | | | | | |
|
63 |
-
|
64 |
-
In addition, we evaluated our Hammer2.0 series (0.5b, 1.5b, 3b, 7b) on other academic benchmarks to further show our model's generalization ability:
|
65 |
-
|
66 |
-
| Model | Size | Func-Name+Args Det. (F1 Func-Name \| F1 Args) | | | | | | | | | | F1 Average | |
|
67 |
-
|:---------------------------:|:----:|:---------------------------------------------:|:-----:|:------------:|:-----:|:-----------:|:-----:|:---------------------:|:-----:|:-----------:|:-----:|:----------:|:-----:|
|
68 |
-
| | | API-Bank L-1 | | API-Bank L-2 | | Tool-Alpaca | | SealTool(Single-Tool) | | Nexus Raven | | Func Name | Args |
|
69 |
-
| GPT-4o-mini (Prompt) | -- | 95.1% | 89.3% | 84.3% | 67.5% | 64.3% | 54.7% | 87.9% | 86.0% | 91.7% | 84.6% | 84.7% | 76.4% |
|
70 |
-
| qwen2-7b-instruct | 7B | 81.5% | 60.6% | 95.7% | 49.5% | 71.6% | 48.1% | 93.9% | 77.5% | 87.1% | 63.5% | 85.9% | 59.8% |
|
71 |
-
| qwen1.5-4b-Chat | 4B | 55.3% | 59.8% | 46.7% | 38.5% | 35.4% | 17.0% | 48.4% | 62.3% | 29.0% | 33.7% | 43.0% | 42.2% |
|
72 |
-
| qwen2-1.5b-instruct | 1.5B | 74.6% | 63.6% | 57.7% | 33.6% | 65.8% | 45.2% | 82.1% | 75.5% | 70.6% | 45.5% | 70.2% | 52.7% |
|
73 |
-
| Gorilla-openfunctions-v2 | 7B | 69.2% | 70.3% | 48.8% | 54.7% | 72.9% | 51.3% | 93.2% | 91.1% | 72.8% | 68.4% | 71.4% | 67.2% |
|
74 |
-
| GRANITE-20B-FUNCTIONCALLING | 20B | 90.4% | 77.8% | 78.9% | 59.2% | 77.3% | 58.0% | 94.9% | 92.7% | 94.5% | 75.1% | 87.2% | 72.6% |
|
75 |
-
| xlam-7b-fc-r | 7B | 90.0% | 80.7% | 72.5% | 64.2% | 67.3% | 59.0% | 79.0% | 76.9% | 54.1% | 57.5% | 72.6% | 67.7% |
|
76 |
-
| xlam-1b-fc-r | 1.3B | 94.9% | 83.7% | 91.8% | 64.3% | 64.9% | 50.6% | 90.7% | 80.4% | 64.4% | 54.8% | 81.3% | 66.8% |
|
77 |
-
| Hammer-7b | 7B | 93.5% | 85.8% | 82.9% | 66.4% | 82.3% | 59.9% | 97.4% | 91.7% | 92.5% | 77.4% | 89.7% | 76.2% |
|
78 |
-
| Hammer-4b | 4B | 91.6% | 81.5% | 77.6% | 61.0% | 85.1% | 57.0% | 96.4% | 92.4% | 81.7% | 64.9% | 86.5% | 71.4% |
|
79 |
-
| Hammer-1.5b | 1.5B | 82.1% | 72.3% | 79.8% | 59.7% | 80.9% | 53.5% | 95.6% | 88.6% | 79.9% | 56.9% | 83.7% | 66.2% |
|
80 |
-
| Hammer2.0-0.5B | 0.5B | 81.2% | 67.8% | 62.9% | 52.0% | 79.1% | 50.9% | 94.9% | 83.8% | 74.7% | 49.0% | 78.5% | 60.7% |
|
81 |
-
| Hammer2.0-1.5B | 1.5B | 90.2% | 80.4% | 82.9% | 63.8% | 86.2% | 59.5% | 97.5% | 92.5% | 86.4% | 65.5% | 88.6% | 72.4% |
|
82 |
-
| Hammer2.0-3B | 3B | 93.6% | 84.3% | 83.7% | 59.0% | 83.1% | 58.8% | 95.3% | 91.2% | 92.5% | 70.5% | 89.6% | 72.8% |
|
83 |
-
| Hammer2.0-7B | 7B | 91.0% | 82.1% | 82.5% | 65.1% | 85.2% | 59.6% | 96.8% | 92.7% | 93.0% | 80.5% | 89.7% | 76.0% |
|
84 |
|
85 |
## Requiements
|
86 |
The code of Hammer2.0-3b has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
|
|
|
19 |
Thanks so much for your attention, a report with all the technical details leading to our models will be published soon.
|
20 |
|
21 |
## Evaluation
|
22 |
+
The evaluation results of Hammer 2.0 series on the Berkeley Function-Calling Leaderboard (BFCL) are presented in the following table:
|
23 |
+
<div style="text-align: center;">
|
24 |
+
<img src="v2_figures/bfcl.PNG" alt="overview" width="1000" style="margin: auto;">
|
25 |
+
</div>
|
26 |
+
|
27 |
+
|
28 |
+
In addition, we evaluated Hammer2.0 on other academic benchmarks to further show our model's generalization ability:
|
29 |
+
<div style="text-align: center;">
|
30 |
+
<img src="v2_figures/others.PNG" alt="overview" width="1000" style="margin: auto;">
|
31 |
+
</div>
|
32 |
+
|
33 |
+
On comparison, Hammer 2.0 outperforms models with similar sizes and even surpass many larger models overall.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
## Requiements
|
36 |
The code of Hammer2.0-3b has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
|