(GGUF) New Flash Attention Implementation Without Tensor Cores

#11
by Lewdiculous - opened
LWDCLS Research org
โ€ข
edited May 16

Using the Flash Attention implementation into KoboldCPP it is posible to fit 16K into 8GB of vram @Q4_K_M
When running an IGPU i can fit 16K @Q5_K_S with FA and 512 batch size into 8GB

For the usual use case, a monitor running on the gpu. It's still possible. This is with one monitor on my gpu using 16K context.
image.png

And FA support for cards without tensor cores is coming: https://github.com/LostRuins/koboldcpp/issues/844


That's great news. Hurray. I still keep my usual recommendation for now because of the Tensor Core reqs, but if that's lifted I'll add that as an added recommendations if speeds are good.

Just adding a small data point, with KoboldCPP compiled with this, with a Q8_K 11b model on 2 x 1080 Ti (Pascal) setup, I get:

~20.2 T/s avg (proc + gen) with FP32 FA enabled.
~13.4 T/s avg (proc + gen) with FP32 FA disabled.
So a significant improvement in my case. Whereas with FP16 FA, I saw a decrease. So it definitely has utility for a subset of users.

This and the PR graphs look very promising!



@ABX-AI @Virt-io

Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.


Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.

I've only seen a slight increase with processing with FA, from like 1Kt/s to 1.1Kt/s when ingesting 8k context (turing)
I imagine it'll be a big deal for pascal users though
I'm trying out how well it squishes phi3 context now


16K without FA
image.png
16K with FA
image.png
Phi3 is cursed with insane memory usage, it's worse than llama3 and somehow uses like 2GB extra vram


I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.

I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.


I am unsure if this is caused by Flash Attention F32 but llama3 is running @50+t/s suddenly?
These are fresh responses too, not regens
image.pngimage.png

It's kinda insane ๐Ÿ˜ญ


Is it possible that the gains are also from CUDA 12?

Or did you test against CUDA 12 koboldcpp?


Is it possible that the gains are also from CUDA 12?

Or did you test against CUDA 12 koboldcpp?

Old testing was done with the CUDA 12 Nexesenex forks (their forks have been on Cublas 12+ since like V1.58?)
New testing uses Nexesenex forks too, Cublas 12.2


@Ardvark123 You can text quickly using Nexesenex's KoboldCpp, it's good news indeed.


It gets even better for older gpus
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.66d_b2902%2B2

@saishf

Original discussion started here.

Discussion continues here.

LWDCLS Research org

We're eating good boys.
โ€“ Lewdiculous

Lewdiculous changed discussion title from (GGUF) Flash Attention Without Tensor Cores to (GGUF) New Flash Attention Implementation Without Tensor Cores

Quick testing. 11.7 runs a smidge fast on my turing card, by like 3t/s on average. Why? I don't know ๐Ÿ˜บ
Same vram usage, and same context ingest speed for me

LWDCLS Research org

I'll do some benchmarks when I'm sat at my own machine again. Curious to see if I have the same results in my Pascal.

I'm just glad people aren't doing what Nvidia wants and culling all the old GPUs to buy 40 series
I'm patiently waiting until the A6000 & A100-pcie become the new P40

With the recent release im seeing some insane speeds at low context ๐Ÿ˜ญ

image.png

image.png
That's Llama-3 at 100Tk/s on a 2080 ๐Ÿ˜ถโ€

LWDCLS Research org

We're eating even better now.

N-Nani ? Is this with Nexesenex version ?

LWDCLS Research org
โ€ข
edited May 17

Yes! @Meggido . Will be merged in the next version of official KoboldCpp as well.

I mean, those speeds beat (and by far) exl2, which is the main "selling point." Damn, 50t/s at 10k ๐Ÿ˜ฎ.

@saishf What's the quant used for these examples ? Q4_K_M ?

LWDCLS Research org
โ€ข
edited May 17

My Q4_K_M-imat quants report 4.89 BPW.
Let us know, @saishf , tell the secrets!

Are the numbers correct?


@Meggido What I hear from EXL2 now, the other selling point is the 4-bit KV Cache for context, which makes context much more memory efficient, we're still waiting for that implementation in GGUF form.

One of the obstacles was getting Flash Attention in, that has been done initially 2 weeks ago. Now we wait for Q4 cache. That will be truly huge.

Relevant links to keep the copium tanks filled:

Issue

Discussion

Oh yeah that's exactly what I'm using at the moment : exl2 + 4-bit cache. (more or less as accurate as FP16 from what I've read)
But yeah, if GGUF is getting that fast with... let's say Q6_K (since I use 6.5bpw), I might reconsider ๐Ÿ˜.

I mean, those speeds beat (and by far) exl2, which is the main "selling point." Damn, 50t/s at 10k ๐Ÿ˜ฎ.

@saishf What's the quant used for these examples ? Q4_K_M ?

Q5_K_S, 5.57BPW
I don't know which EXL2 backend is the equivalent to koboldcpp for comparisons

LWDCLS Research org
โ€ข
edited May 17

Honestly, Q5_K_S/M (that 5.50 BPW+ range) is the sweetspot for GGUF quants in my opinion. That's an awesome speed and quality!

@Lewdiculous I've been meaning to ask this, but is the difference between 4KM v 5KM that great in terms of intelligence/creativity? I know it's a fairly subjective question, and I know imatrix give a boost.

Also unfortunately the Nexnexnex builds forces my 2060 6gb to use more vram, FA on or off. I'm hoping it's possible down the line to set FA32 with a flag, since my tensors are likely too few (and immature compared to the 30xx line). Did finally disable MMQ though which gave an unexpected increase since the last time I'd checked (kobold 1.65 main).

LWDCLS Research org
โ€ข
edited May 17

@Sovy I can't comment much on the state of the 2060 for FA, you might want to open an issue or add to the existing ones for the option to force the new implementation with a flag. That'd be handy.

For quants, I usually recommend this gist write-up, you can check the tables and the graphs to see how each quant stacks up:

https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

Q4 is where the quants get decently good and widely usable. Q5 is the best balance of size and quality in my opinion, with minimal loss, although for my usage the Q4s are enough.

Imatrix calibration helps to bring them up closer, especially the Q4s.

The IQ3s "work" but it's at the limit, for really VRAM starved situations, they should benefit a lot from imatrix calibration.

@Lewdiculous I've been meaning to ask this, but is the difference between 4KM v 5KM that great in terms of intelligence/creativity? I know it's a fairly subjective question, and I know imatrix give a boost.

Also unfortunately the Nexnexnex builds forces my 2060 6gb to use more vram, FA on or off. I'm hoping it's possible down the line to set FA32 with a flag, since my tensors are likely too few (and immature compared to the 30xx line). Did finally disable MMQ though which gave an unexpected increase since the last time I'd checked (kobold 1.65 main).

I'd open an issue/question, as I see lower vram usage with my 2080 using nexesenex's builds by 200mb.
I believe the tensor cores are the same so I don't know what would be causing the difference.

@Sovy I can't comment much on the state of the 2060 for FA, you might want to open an issue or add to the existing ones for the option to force the new implementation with a flag. That'd be handy.

For quants, I usually recommend this gist write-up, you can check the tables and the graphs to see how each quant stacks up:

https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

Q4 is where the quants get decently good and widely usable. Q5 is the best balance of size and quality in my opinion, with minimal loss, although for my usage the Q4s are enough.

Imatrix calibration helps to bring them up closer, especially the Q4s.

The IQ3s "work" but it's at the limit, for really VRAM starved situations, they should benefit a lot from imatrix calibration.

I learnt that 3bit quants still aren't great for constant usage, 13b 3bit wasn't good compared to 4bit 11b. Both with imatrix, it felt like it had short term memory loss.
5bit llama3 just feels more like the character. Not really smarter or more creative, just a little more like the character it's portraying.

LWDCLS Research org
โ€ข
edited May 18

Short version with that considered:

Q3 is really the absolute limit not good 'enough' and only for situations where nothing else is possible, Q4 being the point where things are really usable for me, reasonable quality and great speed. Q5 is great balance of quality and speed.

LWDCLS Research org
โ€ข
edited May 18

Benchmarks at 16K Context...

GPU: Stock Speeds - GTX 1070Ti (Pascal)

For me Cuda 12.2 gave me a 3,1% improvement in final T/s speeds, over Cuda 11.7.
VRAM usage was identical.

GPU: Overclocked - GTX 1070Ti (Pascal)

Faster and smaller differences between the two.
For me Cuda 12.2 gave me a 1,0% improvement in final T/s speeds, over Cuda 11.7.
VRAM usage was identical.

LWDCLS Research org

Officially part of original KCPP.

Just got through testing it. Woooo it is so good.

LWDCLS Research org

Since this is now implemented officially in KoboldCpp and is considered working as intended this will be marked as closed, but should anything need to be discussed about it, reply anyways and it shall be opened again!

Lewdiculous changed discussion status to closed

Sign up or log in to comment