System requirement?
What is the system requirement to run this model, and how can I find that?
From reading the config, this is a Flaot16 model, using the Model Memory Estimator (https://huggingface.co/spaces/hf-accelerate/model-memory-usage), it provides the following specs for the wizard-coder 30B (as a llm):
dtype Largest Layer or Residual Group Total Size Training using Adam
float32 2.59 GB 125.48 GB 501.92 GB
int8 664.02 MB 31.37 GB 125.48 GB
float16/bfloat16 1.3 GB 62.74 GB 250.96 GB
int4 332.01 MB 15.68 GB 62.74 GB
So, if you pull this down, you'll need 63GB of RAM to run it. I would love to quantize this to a int8, so it could fit on a 4090 or A6000, but don't know how right now.
I am able to run in on M1 Max 64GB. Not super fast, but it works
llama_print_timings: sample time = 1804.24 ms / 729 runs ( 2.47 ms per token, 404.05 tokens per second)
llama_print_timings: prompt eval time = 3652.04 ms / 144 tokens ( 25.36 ms per token, 39.43 tokens per second)
llama_print_timings: eval time = 94289.78 ms / 728 runs ( 129.52 ms per token, 7.72 tokens per second)
llama_print_timings: total time = 100932.23 ms
Output generated in 101.16 seconds (7.20 tokens/s, 728 tokens, context 144, seed 1690939106)
Llama.generate: prefix-match hit
llama_print_timings: load time = 3652.09 ms
llama_print_timings: sample time = 2548.89 ms / 1024 runs ( 2.49 ms per token, 401.74 tokens per second)
llama_print_timings: prompt eval time = 13158.02 ms / 751 tokens ( 17.52 ms per token, 57.08 tokens per second)
llama_print_timings: eval time = 141916.85 ms / 1023 runs ( 138.73 ms per token, 7.21 tokens per second)
llama_print_timings: total time = 159473.00 ms
Output generated in 159.71 seconds (6.41 tokens/s, 1024 tokens, context 886, seed 1686911609)
Llama.generate: prefix-match hit
llama_print_timings: load time = 3652.09 ms
llama_print_timings: sample time = 694.30 ms / 276 runs ( 2.52 ms per token, 397.52 tokens per second)
llama_print_timings: prompt eval time = 19746.02 ms / 1023 tokens ( 19.30 ms per token, 51.81 tokens per second)
llama_print_timings: eval time = 43975.35 ms / 275 runs ( 159.91 ms per token, 6.25 tokens per second)
llama_print_timings: total time = 64842.96 ms
Output generated in 65.07 seconds (4.23 tokens/s, 275 tokens, context 1909, seed 828516400)