Memory usage
Some information about memory usage.
q4_0 barely fits into 64 GB ram and 8 GB vram with 2k context, leaving only a couple hundred megabytes on my PC.
With this configuration, for a larger context window, I recommend using a smaller quant or rent a GPU server.
64 gb RAM + rtx 3060 ti 8gb VRAM, all layers offloaded to gpu in llama.cpp = 0.22 tokens/s
Overall, a very good model, especially surprising the possibilities in languages other than English (at least the original miqu). It also feels that the model has some "character", something similar to goliath. In writing, the model does not use the robotic language coming from gpt 4 and is more human-like.
Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.
Overall, a very good model, especially surprising the possibilities in languages other than English (at least the original miqu). It also feels that the model has some "character", something similar to goliath. In writing, the model does not use the robotic language coming from gpt 4 and is more human-like.
Thanks for the feedback!
Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.
A IQ2_XS version allows a full offload on 3090/4090 and won't cripple it too much (perplexity +0.7-0.9), and an IQ3_XXS will allow you to a 80-90% offload accordingly to your context size while having the output quality of a Q3_K_S
Yeah, unusable on a 4090 + 128GB. A 2bit quant will cripple this.
Looking forward to a 13B or 20B version.
That's only possible if Mistral has such a model and the will to release it.
Or they have another client with yet another overly enthusiastic employee ;D