(GGUF) New Flash Attention Implementation Without Tensor Cores
Using the Flash Attention implementation into KoboldCPP it is posible to fit 16K into 8GB of vram @Q4_K_M
When running an IGPU i can fit 16K @Q5_K_S with FA and 512 batch size into 8GBFor the usual use case, a monitor running on the gpu. It's still possible. This is with one monitor on my gpu using 16K context.
And FA support for cards without tensor cores is coming: https://github.com/LostRuins/koboldcpp/issues/844
That's great news. Hurray. I still keep my usual recommendation for now because of the Tensor Core reqs, but if that's lifted I'll add that as an added recommendations if speeds are good.
Just adding a small data point, with KoboldCPP compiled with this, with a Q8_K 11b model on 2 x 1080 Ti (Pascal) setup, I get:
~20.2 T/s avg (proc + gen) with FP32 FA enabled.
~13.4 T/s avg (proc + gen) with FP32 FA disabled.
So a significant improvement in my case. Whereas with FP16 FA, I saw a decrease. So it definitely has utility for a subset of users.This and the PR graphs look very promising!
@ABX-AI @Virt-io
Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.
Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.
I've only seen a slight increase with processing with FA, from like 1Kt/s to 1.1Kt/s when ingesting 8k context (turing)
I imagine it'll be a big deal for pascal users though
I'm trying out how well it squishes phi3 context now
16K without FA
16K with FA
Phi3 is cursed with insane memory usage, it's worse than llama3 and somehow uses like 2GB extra vram
I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.
I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.
I am unsure if this is caused by Flash Attention F32 but llama3 is running @50+t/s suddenly?
These are fresh responses too, not regensIt's kinda insane ๐ญ
Is it possible that the gains are also from CUDA 12?
Or did you test against CUDA 12 koboldcpp?
Is it possible that the gains are also from CUDA 12?
Or did you test against CUDA 12 koboldcpp?
Old testing was done with the CUDA 12 Nexesenex forks (their forks have been on Cublas 12+ since like V1.58?)
New testing uses Nexesenex forks too, Cublas 12.2
@Ardvark123 You can text quickly using Nexesenex's KoboldCpp, it's good news indeed.
It gets even better for older gpus
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.66d_b2902%2B2
@saishf
Original discussion started here.
Discussion continues here.
We're eating good boys.
โ Lewdiculous
Quick testing. 11.7 runs a smidge fast on my turing card, by like 3t/s on average. Why? I don't know ๐บ
Same vram usage, and same context ingest speed for me
I'll do some benchmarks when I'm sat at my own machine again. Curious to see if I have the same results in my Pascal.
I'm just glad people aren't doing what Nvidia wants and culling all the old GPUs to buy 40 series
I'm patiently waiting until the A6000 & A100-pcie become the new P40
We're eating even better now.
N-Nani ? Is this with Nexesenex version ?
I mean, those speeds beat (and by far) exl2, which is the main "selling point." Damn, 50t/s at 10k ๐ฎ.
@saishf What's the quant used for these examples ? Q4_K_M ?
My Q4_K_M-imat quants report 4.89 BPW.
Let us know,
@saishf
, tell the secrets!
Are the numbers correct?
@Meggido What I hear from EXL2 now, the other selling point is the 4-bit KV Cache for context, which makes context much more memory efficient, we're still waiting for that implementation in GGUF form.
One of the obstacles was getting Flash Attention in, that has been done initially 2 weeks ago. Now we wait for Q4 cache. That will be truly huge.
Relevant links to keep the copium tanks filled:
Issue
Discussion
Oh yeah that's exactly what I'm using at the moment : exl2 + 4-bit cache. (more or less as accurate as FP16 from what I've read)
But yeah, if GGUF is getting that fast with... let's say Q6_K (since I use 6.5bpw), I might reconsider ๐.
Honestly, Q5_K_S/M (that 5.50 BPW+ range) is the sweetspot for GGUF quants in my opinion. That's an awesome speed and quality!
@Lewdiculous I've been meaning to ask this, but is the difference between 4KM v 5KM that great in terms of intelligence/creativity? I know it's a fairly subjective question, and I know imatrix give a boost.
Also unfortunately the Nexnexnex builds forces my 2060 6gb to use more vram, FA on or off. I'm hoping it's possible down the line to set FA32 with a flag, since my tensors are likely too few (and immature compared to the 30xx line). Did finally disable MMQ though which gave an unexpected increase since the last time I'd checked (kobold 1.65 main).
@Sovy I can't comment much on the state of the 2060 for FA, you might want to open an issue or add to the existing ones for the option to force the new implementation with a flag. That'd be handy.
For quants, I usually recommend this gist write-up, you can check the tables and the graphs to see how each quant stacks up:
https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9
Q4 is where the quants get decently good and widely usable. Q5 is the best balance of size and quality in my opinion, with minimal loss, although for my usage the Q4s are enough.
Imatrix calibration helps to bring them up closer, especially the Q4s.
The IQ3s "work" but it's at the limit, for really VRAM starved situations, they should benefit a lot from imatrix calibration.
@Lewdiculous I've been meaning to ask this, but is the difference between 4KM v 5KM that great in terms of intelligence/creativity? I know it's a fairly subjective question, and I know imatrix give a boost.
Also unfortunately the Nexnexnex builds forces my 2060 6gb to use more vram, FA on or off. I'm hoping it's possible down the line to set FA32 with a flag, since my tensors are likely too few (and immature compared to the 30xx line). Did finally disable MMQ though which gave an unexpected increase since the last time I'd checked (kobold 1.65 main).
I'd open an issue/question, as I see lower vram usage with my 2080 using nexesenex's builds by 200mb.
I believe the tensor cores are the same so I don't know what would be causing the difference.
@Sovy I can't comment much on the state of the 2060 for FA, you might want to open an issue or add to the existing ones for the option to force the new implementation with a flag. That'd be handy.
For quants, I usually recommend this gist write-up, you can check the tables and the graphs to see how each quant stacks up:
https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9
Q4 is where the quants get decently good and widely usable. Q5 is the best balance of size and quality in my opinion, with minimal loss, although for my usage the Q4s are enough.
Imatrix calibration helps to bring them up closer, especially the Q4s.
The IQ3s "work" but it's at the limit, for really VRAM starved situations, they should benefit a lot from imatrix calibration.
I learnt that 3bit quants still aren't great for constant usage, 13b 3bit wasn't good compared to 4bit 11b. Both with imatrix, it felt like it had short term memory loss.
5bit llama3 just feels more like the character. Not really smarter or more creative, just a little more like the character it's portraying.
Short version with that considered:
Q3 is really the absolute limit not good 'enough' and only for situations where nothing else is possible, Q4 being the point where things are really usable for me, reasonable quality and great speed. Q5 is great balance of quality and speed.
Benchmarks at 16K Context...
GPU: Stock Speeds - GTX 1070Ti (Pascal)
For me Cuda 12.2 gave me a 3,1% improvement in final T/s speeds, over Cuda 11.7.
VRAM usage was identical.
GPU: Overclocked - GTX 1070Ti (Pascal)
Faster and smaller differences between the two.
For me Cuda 12.2 gave me a 1,0% improvement in final T/s speeds, over Cuda 11.7.
VRAM usage was identical.
Officially part of original KCPP.
Just got through testing it. Woooo it is so good.
Since this is now implemented officially in KoboldCpp and is considered working as intended this will be marked as closed, but should anything need to be discussed about it, reply anyways and it shall be opened again!