bofenghuang
/

vigogne-2-7b-chat

@@ -1,11 +1,10 @@
 ---
-language:
-- fr
 pipeline_tag: text-generation
-library_name: transformers
 inference: false
 tags:
 - LLM
 - llama
 - llama-2
 ---
@@ -14,11 +13,11 @@ tags:
 <img src="https://huggingface.co/bofenghuang/vigogne-2-7b-chat/resolve/v2.0/logo_v2.jpg" alt="Vigogne" style="width: 30%; min-width: 300px; display: block; margin: auto;">
 </p>
-# Vigogne-2-7B-Chat-V2.0: A Llama-2 based French chat LLM
-Vigogne-2-7B-Chat-V2.0 is a French chat LLM, based on [LLaMA-2-7B](https://ai.meta.com/llama), optimized to generate helpful and coherent responses in user conversations.
-Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) and [GitHub repository](https://github.com/bofenghuang/vigogne) for more information.
 **Usage and License Notices**: Vigogne-2-7B-Chat-V2.0 follows Llama-2's [usage policy](https://ai.meta.com/llama/use-policy). A significant portion of the training data is distilled from GPT-3.5-Turbo and GPT-4, kindly use it cautiously to avoid any violations of OpenAI's [terms of use](https://openai.com/policies/terms-of-use).
@@ -27,14 +26,60 @@ Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023
 All previous versions are accessible through branches.
 - **V1.0**: Trained on 420K chat data.
-- **V2.0**: Trained on 520K data. Check out our [blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
 ## Usage
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TextStreamer
-from vigogne.preprocess import generate_inference_chat_prompt
 model_name_or_path = "bofenghuang/vigogne-2-7b-chat"
 revision = "v2.0"
@@ -45,18 +90,22 @@ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, revision=revisi
 streamer = TextStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
-def infer(
-    utterances,
-    system_message=None,
-    temperature=0.1,
-    top_p=1.0,
-    top_k=0,
-    repetition_penalty=1.1,
-    max_new_tokens=1024,
     **kwargs,
 ):
-    prompt = generate_inference_chat_prompt(utterances, tokenizer, system_message=system_message)
-    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(model.device)
     input_length = input_ids.shape[1]
     generated_outputs = model.generate(
@@ -68,26 +117,76 @@ def infer(
             top_k=top_k,
             repetition_penalty=repetition_penalty,
             max_new_tokens=max_new_tokens,
-            eos_token_id=tokenizer.eos_token_id,
-            pad_token_id=tokenizer.pad_token_id,
             **kwargs,
         ),
         streamer=streamer,
         return_dict_in_generate=True,
     )
     generated_tokens = generated_outputs.sequences[0, input_length:]
     generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
-    return generated_text
-user_query = "Expliquez la différence entre DoS et phishing."
-infer([[user_query, ""]])
 ```
-You can utilize the Google Colab Notebook below for inferring with the Vigogne chat models.
 <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 ## Limitations
 Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.

 ---
+language: fr
 pipeline_tag: text-generation
 inference: false
 tags:
 - LLM
+- finetuned
 - llama
 - llama-2
 ---
 <img src="https://huggingface.co/bofenghuang/vigogne-2-7b-chat/resolve/v2.0/logo_v2.jpg" alt="Vigogne" style="width: 30%; min-width: 300px; display: block; margin: auto;">
 </p>
+# Vigogne-2-7B-Chat-V2.0: A Llama-2-based French Chat LLM
+Vigogne-2-7B-Chat-V2.0 is a French chat LLM, based on [LLaMA-2-7B](https://ai.meta.com/llama), optimized to generate helpful and coherent responses in conversations with users.
+Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) and [GitHub repository](https://github.com/bofenghuang/vigogne) for more information.
 **Usage and License Notices**: Vigogne-2-7B-Chat-V2.0 follows Llama-2's [usage policy](https://ai.meta.com/llama/use-policy). A significant portion of the training data is distilled from GPT-3.5-Turbo and GPT-4, kindly use it cautiously to avoid any violations of OpenAI's [terms of use](https://openai.com/policies/terms-of-use).
 All previous versions are accessible through branches.
 - **V1.0**: Trained on 420K chat data.
+- **V2.0**: Trained on 520K data. Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
+## Prompt Template
+We utilized prefix tokens `<user>:` and `<assistant>:` to distinguish between user and assistant utterances.
+You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("bofenghuang/vigogne-2-7b-chat")
+conversation = [
+    {"role": "user", "content": "Bonjour ! Comment ça va aujourd'hui ?"},
+    {"role": "assistant", "content": "Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ?"},
+    {"role": "user", "content": "Quelle est la hauteur de la Tour Eiffel ?"},
+    {"role": "assistant", "content": "La Tour Eiffel mesure environ 330 mètres de hauteur."},
+    {"role": "user", "content": "Comment monter en haut ?"},
+]
+print(tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True))
+```
+You will get
+```
+<s><|system|>: Vous êtes l'assistant IA nommé Vigogne, créé par Zaion Lab (https://zaion.ai). Vous suivez extrêmement bien les instructions. Aidez autant que vous le pouvez.
+<|user|>: Bonjour ! Comment ça va aujourd'hui ?
+<|assistant|>: Bonjour ! Je suis une IA, donc je n'ai pas de sentiments, mais je suis prêt à vous aider. Comment puis-je vous assister aujourd'hui ?</s>
+<|user|>: Quelle est la hauteur de la Tour Eiffel ?
+<|assistant|>: La Tour Eiffel mesure environ 330 mètres de hauteur.</s>
+<|user|>: Comment monter en haut ?
+<|assistant|>:
+```
 ## Usage
+### Inference using the quantized versions
+The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
+- AWQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-AWQ)
+- GTPQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GPTQ)
+- GGUF for CPU+GPU inference: [TheBloke/Vigogne-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF)
+These versions facilitate testing and development with various popular frameworks, including [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), and more.
+### Inference using the unquantized model with 🤗 Transformers
 ```python
+from typing import Dict, List, Optional
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, TextStreamer
 model_name_or_path = "bofenghuang/vigogne-2-7b-chat"
 revision = "v2.0"
 streamer = TextStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
+def chat(
+    query: str,
+    history: Optional[List[Dict]] = None,
+    temperature: float = 0.7,
+    top_p: float = 1.0,
+    top_k: float = 0,
+    repetition_penalty: float = 1.1,
+    max_new_tokens: int = 1024,
     **kwargs,
 ):
+    if history is None:
+        history = []
+    history.append({"role": "user", "content": query})
+    input_ids = tokenizer.apply_chat_template(history, add_generation_prompt=True, return_tensors="pt").to(model.device)
     input_length = input_ids.shape[1]
     generated_outputs = model.generate(
             top_k=top_k,
             repetition_penalty=repetition_penalty,
             max_new_tokens=max_new_tokens,
+            pad_token_id=tokenizer.eos_token_id,
             **kwargs,
         ),
         streamer=streamer,
         return_dict_in_generate=True,
     )
     generated_tokens = generated_outputs.sequences[0, input_length:]
     generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
+    history.append({"role": "assistant", "content": generated_text})
+    return generated_text, history
+# 1st round
+response, history = chat("Un escargot parcourt 100 mètres en 5 heures. Quelle est sa vitesse ?", history=None)
+# 2nd round
+response, history = chat("Quand il peut dépasser le lapin ?", history=history)
+# 3rd round
+response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
 ```
+You can also use the Google Colab Notebook provided below.
 <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
+### Inference using the unquantized model with vLLM
+Set up an OpenAI-compatible server with the following command:
+```bash
+# Install vLLM
+# This may take 5-10 minutes.
+# pip install vllm
+# Start server for Vigogne-Chat models
+python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-7b-chat
+# List models
+# curl http://localhost:8000/v1/models
+```
+Query the model using the openai python package.
+```python
+import openai
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai.api_key = "EMPTY"
+openai.api_base = "http://localhost:8000/v1"
+# First model
+models = openai.Model.list()
+model = models["data"][0]["id"]
+# Chat completion API
+chat_completion = openai.ChatCompletion.create(
+    model=model,
+    messages=[
+        {"role": "user", "content": "Parle-moi de toi-même."},
+    ],
+    max_tokens=1024,
+    temperature=0.7,
+)
+print("Chat completion results:", chat_completion)
+```
 ## Limitations
 Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.

tokenizer_config.json CHANGED Viewed

@@ -19,6 +19,7 @@
     "single_word": false
   },
   "legacy": false,
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": null,
   "padding_side": "right",

     "single_word": false
   },
   "legacy": false,
+  "chat_template": "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif true == true %}{% set loop_messages = messages %}{% set system_message = 'Vous êtes l\\'assistant IA nommé Vigogne, créé par Zaion Lab (https://zaion.ai). Vous suivez extrêmement bien les instructions. Aidez autant que vous le pouvez.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|system|>: ' + system_message + '\\n' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '<|user|>: ' + message['content'].strip() + '\\n' }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>: ' + message['content'].strip() + eos_token + '\\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>:' }}{% endif %}",
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": null,
   "padding_side": "right",