Why not add system requirements on the model card?
Hi
I had to search for a while to find a bit of info about what the requirements are to run this, it would be nice to have more info on the model card!
thx
Using newest transformer & accelerate library from pip github + using bitsandbytes config (load_in_4bit, bfloat16, and nf4 quant type), I am able to run this on single A100 40 GB. Its using 80 GB of disk space for saving pretrained model.
Hi @Ichsan2895 I’m pretty new to this model and llm general. I’d like test this in one of azure environment and there are many which are available. Do you by any chance know which one of the vm sizes can be used. There are many which are not available due to shortage and high demands.
Thank you
Hi @Ichsan2895 I’m pretty new to this model and llm general. I’d like test this in one of azure environment and there are many which are available. Do you by any chance know which one of the vm sizes can be used. There are many which are not available due to shortage and high demands.
Thank you
Hello, Sorry I never test it on Azure.
I tested it on Runpods environment. It cost $ 0.85/hour which has A6000 48GB VRAM + 58 GB RAM + 200 GB Disk when it running and cost $ 0.03/hour when system idle because I saving pretrained model in their Disk too.
thank you for the response. How was the performance on this machine(tk/sec)?
Pretty slow.. About 0.5-1 token/second. BTW, Guanaco-65-GPTQ is faster but unfortunatelly it can not be use for commercial.
@Ichsan2895 I was able to run this on Standard_NC48ads_A100_v4 which has 160 GiB GPU mem. I wasn't able to use bitsandbytes module (some issue, I couldn't debug it). The results were surprisingly good. I could only use it for a very short time because its pretty expensive. See https://twitter.com/this_is_tckb/status/1665814803829473280/
is it possible to run it on RTX 4090?
sorry guys but can someone tell me what does 40b mean? what I know is 40b x 4 = 160GB right?
does it mean one GPU with total of 160GB can load this model?
or I need 160GB+ for training? and training is different than using?
40b is 40 billion parameter used. But it does not mean that it need 40 GB GPU RAM. I used 48 GB A6000 to run this model. It can be optimized (for lowering consumption) by activating bitsandbytes config. Which enable bfloat16 and load_in_4bit. Unfortunatelly it wont run in 24GB VRAM (OOM).
Sorry I dont know the consumption when it training/fine tuning new dataset.
We have added some basic info on running the model to the card. It takes ~80-100GB to comfortably infer Falcon-40B. There has been some work with FalconTune on 4-bit quantization as well.