File size: 5,098 Bytes
aae0a0b
 
2b9363a
 
 
 
 
 
 
aae0a0b
018c8ff
 
 
2b9363a
 
 
 
a45376a
 
 
 
 
 
2b9363a
 
 
 
 
 
a45376a
2b9363a
 
 
 
 
 
 
 
 
 
 
a45376a
2b9363a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: llama2
datasets:
- marclove/llama_functions
- timdettmers/openassistant-guanaco
language:
- en
library_name: transformers
pipeline_tag: conversational
---
# Demo
Try this model in my [demo space](https://huggingface.co/spaces/marclove/llama2-7b-chat-functions)!

# Model Card for Llama-2 7B Chat Functions

‼️ This model is still in a beta state. It will be retrained at a future data and updated, during which its prompting format may change. If you need to depend on it in its current state, please create your own fork and provide attribution to this original repository. ‼️

Llama Functions is a further fine-tuned version of [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), using a 50/50 mix of:

1. Synthetic OpenAPI function calls with their corresponding natural language invocation, and
2. Chat completions from the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).

13B & 70B versions are coming soon.

The function calling dataset is mixed with Guanaco in order to maintain accuracy and helpfulness when calling a function is not the appropriate response. Guidelines for use, more detailed information regarding limitations, and eval stats of 7B, 13B, and 70B models.

There is no existing evaluation benchmark to measure the accuracy of function calls, which makes it hard during training to identify when we've maximized the balance of function calling accuracy and chat model performance. I'm working on a custom HF eval for this purpose, but until then I have chosen to mix the two datasets in equal parts to get a proxy of performance for both tasks in the eval & test stats during fine-tuning. The current checkpoint is at 1000 steps, when eval & test loss reached their lowest point.

- **Developed by:** Marc Love
- **License:** [Llama 2 Community License](https://ai.meta.com/llama/license/)
- **Finetuned from:** [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** Coming soon
- **Demo:** [llama2-7b-chat-functions](https://huggingface.co/spaces/marclove/llama2-7b-chat-functions)

## Uses

**Please note:** The synthetic data portion of the dataset was generated using OpenAI models. This model is released under the Llama 2 Community License, per the Llama 2 Community License Agreement. Since I fine-tuned them model on OpenAI generated data that I generated, this model is released for research purposes only. I have licensed the associated `llama_functions` dataset under the [Creative Commons' Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/). Whether you may use that data to train your own models is your responsibility to determine. 

## Bias, Risks, and Limitations

No additional bias beyond that of the underlying model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and those introduced by the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco). 

This model can hallucinate function calls that do not exist in the system prompt. While I hope to improve this by iterating on the `llama_functions` dataset, the 7B model will likely continue to struggle with this. I'm hoping to see more accuracy and less hallucination in larger models and plan to experiment with inference strategies, such as [grammar-based sampling](https://github.com/ggerganov/llama.cpp/pull/1773) and classifier-based routing, to improve performance in smaller models.

At the very minimum, I encourage you to validate outputs before attempting to use responses to call any functions. For example, several people have found Pydantic to be a convenient way to both describe functions and validate calls prior to execution.


## Training Details

### Training Data

See the [`llama_functions` dataset](https://huggingface.co/datasets/marclove/llama_functions) for more information.

### Training Procedure 

Coming soon

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->


#### Training Hyperparameters

Coming soon

<!--- **Training regime:** [More Information Needed] <-- fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Sizes

11B & 70B chat and non-chat versions coming soon

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

## Evaluation

Coming soon
<!-- This section describes the evaluation protocols and provides the results. -->


## Citation

```
@misc{LlamaFunctions,
  title = {LlamaFunctions: An Open Dataset of Structured API Calls From Natural Language Prompts},
  author = {Marc Love},
  year = {2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
  howpublished = {\url{https://https://huggingface.co/marclove/llama_functions},
}
```