Llamacpp Quantizations of Ina-v11.1

Ina interprets persona definitions as executable instructions. The model follows <<CHARACTER_DESCRIPTION>> blocks with extremely high fidelity even during 10kโ€“15k token erotic or dark-fiction role-play sessions.

Fine-tuned by BaiAI and Eric Hartford (QuixiAI) using QLoRA + DPO on large volumes of RP logs, creator-voice datasets, and persona modules.

Contributor Credits:


Using llama.cpp

Original model: https://huggingface.co/QuixiAI/Ina-v11.1

Run them in LM Studio

Run them directly with llama.cpp, or any other llama.cpp based project

Download a file (not the whole branch) from below:

Filename Quant type File Size Split Description
Ina-v11.1-Q8_0.gguf Q8_0 70GB true Extremely high quality, generally unneeded but max available quant.
Ina-v11.1-Q6_K.gguf Q6_K 54GB true Very high quality, near perfect, recommended.
Ina-v11.1-Q5_K_M.gguf Q5_K_M 47GB true High quality, recommended.
Ina-v11.1-Q5_K_S.gguf Q5_K_S 38GB true High quality, recommended.
Ina-v11.1-Q4_K_M.gguf Q4_K_M 40GB true Good quality, default size for most use cases, recommended.
Ina-v11.1-Q4_1.gguf Q4_1 41GB true Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon.
Ina-v11.1-Q4_K_S.gguf Q4_K_S 38GB true Slightly lower quality with more space savings, recommended.
Ina-v11.1-Q4_0.gguf Q4_0 37GB true Legacy format, offers online repacking for ARM and AVX CPU inference.
Ina-v11.1-IQ4_NL.gguf IQ4_NL 38GB true Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference.
Ina-v11.1-IQ4_XS.gguf IQ4_XS 36GB true Decent quality, smaller than Q4_K_S with similar performance, recommended.
Ina-v11.1-Q3_K_L.gguf Q3_K_L 35GB true Lower quality but usable, good for low RAM availability.
Ina-v11.1-Q3_K_M.gguf Q3_K_M 32GB true Low quality.
Ina-v11.1-IQ3_M.gguf IQ3_M 30GB true Medium-low quality, new method with decent performance comparable to Q3_K_M.
Ina-v11.1-Q3_K_S.gguf Q3_K_S 29GB true Low quality, not recommended.
Ina-v11.1-IQ3_XS.gguf IQ3_XS 27GB false Lower quality, new method with decent performance, slightly better than Q3_K_S.

Downloading using huggingface-cli

Click to view download instructions

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download QuixiAI/Ina-v11.1-gguf --include "Ina-v11.1-Q4_K_M.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download QuixiAI/Ina-v11.1-gguf --include "Ina-v11.1-Q8_0/*" --local-dir ./

You can either specify a new local-dir (Ina-v11.1-Q8_0) or download them all in place (./)

Which file should I choose?

Click here for details

A great write up with charts showing various performances is provided by Artefact2 here

The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.

If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.

If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.

If you want to get more into the weeds, you can check out this extremely useful feature chart:

llama.cpp feature matrix

But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.

These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.

Credits

I copied Bartowski's model card and made it my own, cheers!

Downloads last month
2,195
GGUF
Model size
71B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for QuixiAI/Ina-v11.1-gguf

Finetuned
QuixiAI/Ina-v11.1
Quantized
(8)
this model