Instructions to use alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit")
model = AutoModelForCausalLM.from_pretrained("alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit

SGLang

How to use alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit with Docker Model Runner:
```
docker model run hf.co/alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit
```

Model Card for alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit

This repo contains 8-bit quantized (using bitsandbytes) model Mistral AI_'s Mistral-7B-Instruct-v0.2

Model Details

Model creator: Mistral AI_
Original model: Mistral-7B-Instruct-v0.2

About 8 bit quantization using bitsandbytes

QLoRA: Efficient Finetuning of Quantized LLMs: arXiv - QLoRA: Efficient Finetuning of Quantized LLMs
Hugging Face Blog post on 8-bit quantization using bitsandbytes: A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
bitsandbytes github repo: bitsandbytes github repo

How to Get Started with the Model

Use the code below to get started with the model.

How to run from Python code

First install the package

!pip install --quiet bitsandbytes
!pip install --quiet --upgrade transformers # Install latest version of transformers
!pip install --quiet --upgrade accelerate
!pip install --quiet sentencepiece
pip install flash-attn --no-build-isolation

Import

import torch
import os
from torch import bfloat16
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, LlamaForCausalLM

Use a pipeline as a high-level helper

model_id_mistral = "alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit"

tokenizer_mistral = AutoTokenizer.from_pretrained(model_id_mistral, use_fast=True)

model_mistral = AutoModelForCausalLM.from_pretrained(
    model_id_mistral,
    device_map="auto"
)


pipe_mistral = pipeline(model=model_mistral, tokenizer=tokenizer_mistral, task='text-generation')

prompt_mistral = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."

output_mistral = pipe_llama(prompt_mistral, max_new_tokens=512)

print(output_mistral[0]["generated_text"])

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Evaluation

Metrics

[More Information Needed]

Results

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 5

Safetensors

Model size

7B params

Tensor type

F32

F16

Paper for alokabhishek/Mistral-7B-Instruct-v0.2-bnb-8bit

QLoRA: Efficient Finetuning of Quantized LLMs

Paper • 2305.14314 • Published May 23, 2023 • 61