GGUF
Not-For-All-Audiences
nsfw
Inference Endpoints

Memory usage

#2
by Ainonake - opened

Some information about memory usage.
q4_0 barely fits into 64 GB ram and 8 GB vram with 2k context, leaving only a couple hundred megabytes on my PC.
With this configuration, for a larger context window, I recommend using a smaller quant or rent a GPU server.

64 gb RAM + rtx 3060 ti 8gb VRAM, all layers offloaded to gpu in llama.cpp = 0.22 tokens/s

Overall, a very good model, especially surprising the possibilities in languages other than English (at least the original miqu). It also feels that the model has some "character", something similar to goliath. In writing, the model does not use the robotic language coming from gpt 4 and is more human-like.

Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.

NeverSleep org

Overall, a very good model, especially surprising the possibilities in languages other than English (at least the original miqu). It also feels that the model has some "character", something similar to goliath. In writing, the model does not use the robotic language coming from gpt 4 and is more human-like.

Thanks for the feedback!

Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.

A IQ2_XS version allows a full offload on 3090/4090 and won't cripple it too much (perplexity +0.7-0.9), and an IQ3_XXS will allow you to a 80-90% offload accordingly to your context size while having the output quality of a Q3_K_S

Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.

That's only possible if Mistral has such a model and the will to release it.
Or they have another client with yet another overly enthusiastic employee ;D

Sign up or log in to comment