How to make it work for less experienced AI whisperers
git clone https://huggingface.co/tiiuae/falcon-7b -> Saves into local directory
Create new anaconda environment with Transformers=4.27.4 and python=3.9
a) conda create --name falcon python=3.9
b) conda activate falcon
c) pip install transformers==4.27.4
d) pip install huggingface-hub
e) pip install chardet
f) pip install cchardet
g) pip install torch
h) pip install einops
i) pip install accelerate
j) conda install cudatoolkitFollowing code finally gave results:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = (local windows path to directory (i.e. "X:\\ai\\falcon-7b") where is config.json pulled in step 1.
rrmodel = AutoModelForCausalLM.from_pretrained(model,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",)
tokenizer = AutoTokenizer.from_pretrained(model)
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
Generate text
attention_mask = torch.ones(input_ids.shape)
output = rrmodel.generate(input_ids,
attention_mask=attention_mask,
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,)
Decode the output
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
output >>>>>>>>>>
Once upon a time, a man named Charlie Brown walked into a candy store. He asked the lady behind the counter if she had any good chocolate. The lady said that she had some very good chocolate.
Charlie Brown said, "That sounds good. Can you give me a pound of it?"
The lady said, "Sure," and she put the pound of chocolate in a bag and rang up the sale.
Charlie Brown said, "That's $7.
<<<<<<<< output <<<<<<<<
[attention_mask?](NameError: name 'attention_mask' is not defined) ?
[attention_mask?](NameError: name 'attention_mask' is not defined) ?
Thanks, updated original message with fix.
Just add:
attention_mask = torch.ones(input_ids.shape)
before .... generate....
also added:
input_ids = input_ids.to('cuda')
before :
attention_mask = torch.ones(input_ids.shape)
now working nice! thx
@bilelm
It is private desktop computer, i am interested in creating local LLM-s:
AMD Ryzen 9 7900X 12-Core (24 thread) 4.7GHz
64Gb DDR5 Ram 4800MHz
NVidia GeForce RTX 3090 24Gb
Model is loaded from NAS over 1GBit/s network (it takes ~2min)
It took 78 seconds for:
Question: Where girrafe lives, and how tall is giraffe?
Answer: Giraffes live in the African continent, they are the tallest land animals.
The tallest one of them is the African male giraffe, which can stand up to 5.9 meters tall.
Giraffes are herbivorous animals, they feed on leaves and grasses.
They are not dangerous animals. They are very friendly and kind to humans, they don’t attack or eat people.
Giraffes have a lifespan of 20 years.<|endoftext|>
It took 156 seconds for:
Question: Where llama lives, and how tall is llama?
Answer: The llama is a South American camelid, a member of the camel family. It is a large, sturdy animal with a thick coat. Llamas are domesticated and are used for meat, wool, and milk.
The llama is a South American camelid, a member of the camel family. It is a large, sturdy animal with a thick coat. Llamas are domesticated and are used for meat, wool, and milk.
What is a llama? A llama is a South American camelid, a member of the camel family. They are domesticated and used for meat, wool, and milk.
How do llamas look? Llama’s are a large, furry animal that looks like a mix between a camel and a giraffe. They are native to South America but are now found in many other places around the world.
Where do llamas live? Llamas live in the Andes mountains, where they graze on vegetation.
I hope this helps, for what is worth, Falcon-7B answers are pretty good.
@FalconLLM , @Sloba Quick question, Can I run it on Macbook Pro with intel chip with 32 RAM?
I'm trying to run this on a Apple M1 Max..
the code I use is this:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "./falcon-7b"
device_name = 'cpu'
device = torch.device(device_name)
rrmodel = AutoModelForCausalLM.from_pretrained(model,
trust_remote_code=True,
device_map="auto")
rrmodel = rrmodel.to(device)
tokenizer = AutoTokenizer.from_pretrained(model)
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
input_ids = input_ids.to(device)
attention_mask = torch.ones(input_ids.shape)
attention_mask = attention_mask.to(device)
output = rrmodel.generate(input_ids,
attention_mask=attention_mask,
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
Using device_name = 'cpu' this take 5m 50s mins to run.
I try to use device_name = 'mps' for acceleration on the m1 chip.
But I get this error:
Traceback (most recent call last):
File "/Users/mario/Downloads/main.py", line 19, in <module>
output = rrmodel.generate(input_ids,
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/transformers/generation/utils.py", line 1565, in generate
return self.sample(
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/transformers/generation/utils.py", line 2612, in sample
outputs = self(
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/Users/mario/.cache/huggingface/modules/transformers_modules/falcon-7b/modelling_RW.py", line 753, in forward
transformer_outputs = self.transformer(
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/Users/mario/.cache/huggingface/modules/transformers_modules/falcon-7b/modelling_RW.py", line 590, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/functional.py", line 2238, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Placeholder storage has not been allocated on MPS device!
@BliepBlop Did you encounter this issue when you run above code?
ValueError: The current device_map
had weights offloaded to the disk. Please provide an offload_folder
for them. Alternatively, make sure you have safetensors
installed if the model you are using offers the weights in this format.
I installed safetensors
.
Excellent post. Thanks for providing this.
All worked for me on a 4090, Ubuntu 20.04
Alternative: https://github.com/cmp-nct/ggllm.cpp/blob/master/README.md
Includes a video how to compile it on windows, does not need a complex conda/python backend and runs with just a few GB or RAM (or VRAM) 10+ times faster than with python
Also includes exe binary release for windows (for cpu and cuda) if you don't want to get into development frameworks
can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file
can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file
Fine tuning typically involves a clean set of inputs and outputs, not a text with simple information.
You can look into fine tune projects for falcon and how their input data looks like, it will need an elaborate effort to transform your text into good input and output.
The more likely solution is to just prompt Falcon with your text and ask it to use it as information source. By using a good fine tune that follows your prompt you can increase the quality.