Instructions to use tiiuae/falcon-7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-7b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-7b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-7b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiiuae/falcon-7b-instruct
- SGLang
How to use tiiuae/falcon-7b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiiuae/falcon-7b-instruct with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-7b-instruct
Facing Issues with Model Output and Inference Times
I am implementing RAG architecture with ChromaDB as my Vector Store and Falcon-7B as my LLM. I have used Langchains retriever to tie these together. While testing with a single PDF and search results set to return the top 3 matches, I face a number of issues.
- The returned answers are not accurate (Tried different Temperature settings)
- The model takes a long time and then responds with the same sentence repeated multiple times. (Increasing repetition penalty mitigated this to an extent)
- Model does not return with an answer for extended period of times, sometimes greater than 10-15 mins.
- Model response is slow. 5X slow in some cases when compared to models like Llama-2 7B or 13B
I reduced the returned search results from 3 to 1, which improved parts of the accuracy and time, however the model stops responding after being queried 3-4 times.
All of these issues have been reported in some form or the other previously
Wrong Output
while giving a input but getting the wrong output for the particular input
falcon-7b-instruct is answering out of context
Repeats the same sentence
any success in In-context question-answering?
Model keeps generating multiple rounds of conversation
Model is Slow or does not give output
Slow inference
4th inference in a row does not work for Falcon7B in 8 or 4 bit
I am using the 16bit version of the model and running on two T4 GPUs on AWS.
Please let me know if there are any workarounds or fixes for the above.
Thanks
When I set the returned search results from VecDB to 3(larger prompt), the model takes
1st Question(Answer is wrong)CPU times: user 2min 39s, sys: 391 ms, total: 2min 39s
2nd Question (Answer is wrong)CPU times: user 47.4 s, sys: 7.26 ms, total: 47.4 s
and then does not respond from the third onwards
When I decrease the results to 1 (smaller prompt)
1st question takes(Answer is right)CPU times: user 44.6 s, sys: 288 ms, total: 44.9 s
2nd Question takes(Answer is somewhat right)CPU times: user 17.4 s, sys: 0 ns, total: 17.4 s
and then does not respond from the third onwards as above.