One gpu version
#22
by
joorei
- opened
Hello, thanks for the model? Are there any plans to make a version of this that is usable with a single GPU with 24GB VRAM?
hi
@joorei
You can try to run it in 8-bit and see if it works:
First install the main
branch of transformers
pip install git+https://github.com/huggingface/transformers@main
Install accelerate
and bitsandbytes
pip install accelerate bitsandbytes
and load the model with the flag load_in_8bit=True
when calling .from_pretrained
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
Did it work?
What are general system recommendations for different versions on flan-t5?
Thank you!
Hi @Mayuresh86 you can also run it in 4bit now
pip install -U transformers bitsandbytes accelerate
then run:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_4bit=True)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))