Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/falcon-7b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tiiuae/falcon-7b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/falcon-7b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiiuae/falcon-7b-instruct

SGLang

How to use tiiuae/falcon-7b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/falcon-7b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/falcon-7b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/falcon-7b-instruct
```

Facing Issues with Model Output and Inference Times

#78

by ankity09 - opened Aug 30, 2023

Discussion

ankity09

Aug 30, 2023

I am implementing RAG architecture with ChromaDB as my Vector Store and Falcon-7B as my LLM. I have used Langchains retriever to tie these together. While testing with a single PDF and search results set to return the top 3 matches, I face a number of issues.

The returned answers are not accurate (Tried different Temperature settings)
The model takes a long time and then responds with the same sentence repeated multiple times. (Increasing repetition penalty mitigated this to an extent)
Model does not return with an answer for extended period of times, sometimes greater than 10-15 mins.
Model response is slow. 5X slow in some cases when compared to models like Llama-2 7B or 13B

I reduced the returned search results from 3 to 1, which improved parts of the accuracy and time, however the model stops responding after being queried 3-4 times.

All of these issues have been reported in some form or the other previously

Wrong Output

while giving a input but getting the wrong output for the particular input
falcon-7b-instruct is answering out of context
Repeats the same sentence
any success in In-context question-answering?
Model keeps generating multiple rounds of conversation

Model is Slow or does not give output

Slow inference
4th inference in a row does not work for Falcon7B in 8 or 4 bit

I am using the 16bit version of the model and running on two T4 GPUs on AWS.

Please let me know if there are any workarounds or fixes for the above.

Thanks

ankity09

Aug 30, 2023

When I set the returned search results from VecDB to 3(larger prompt), the model takes
1st Question(Answer is wrong)
CPU times: user 2min 39s, sys: 391 ms, total: 2min 39s
2nd Question (Answer is wrong)
CPU times: user 47.4 s, sys: 7.26 ms, total: 47.4 s
and then does not respond from the third onwards

When I decrease the results to 1 (smaller prompt)
1st question takes(Answer is right)
CPU times: user 44.6 s, sys: 288 ms, total: 44.9 s
2nd Question takes(Answer is somewhat right)
CPU times: user 17.4 s, sys: 0 ns, total: 17.4 s
and then does not respond from the third onwards as above.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment