google/gemma-2-9b · Gemma 2's Flash attention 2 implementation is strange...

Jul 11

I tested with torch.manual_seed(0).

eager attention => normal result
flash attention 2 => 1's not. The 2's "to be's for.3' for4. 2 That 4 2 the 4 that 4 for. 4's 4' to 4''' the 4'' to. 4' 4 4 4to lose to. 4 the' 4 4 4' 4' 4 the 4 the 4 4 4 ...

rsdfsfas

Jul 12

It is almost the same without any attention

GPT007

Jul 12

With "eager", it works good.

zokica

Jul 12

yes, it should be fixed when you install new version of flash attention from source.

GPT007

Jul 12

I installed it yesterday 😅
And on windows, so it took a few hours 😨

GPT007

Jul 12

pip freeze | findstr flash-attn
flash-attn==2.5.9.post1

GPT007

Jul 12

•

edited Jul 12

OH NO

GPT007

Jul 12

Took 2 hours, but finally installed flash-attention >= 2.6.0

GPT007 changed discussion status to closed Jul 12

GPT007 changed discussion status to open Jul 12

GPT007

Jul 12

4. 4: 4's over 4. I 4's. 4's. 4.

5's a great. 5' 5' 5' 6' 6. 6' 6' 6' 6' 6. 6' 6' 6' 6' 6 7. 7' 7' 7' 8' 8 8' 8' 8 8 9 9 9 9' 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

It is still making weird things

GPT007

Jul 12

pip freeze | findstr flash-attn
flash-attn==2.6.0.post1

GPT007

Jul 12

pip install --upgrade transformers

Go at tansformers/utils/import_utils and change at line 815:

def is_flash_attn_greater_or_equal(str_version):
    if not _is_package_available("flash_attn"):
        return False

    return version.parse(importlib.metadata.version("flash_attn")) >= version.parse(str_version)

GPT007

Jul 12

I made a PR

GPT007

Jul 12

Still facing the same issues...

rsdfsfas

Jul 12

yeah , I installed 2.6 of 11 July via pip and it still does not work. I thought this was because pip did not include the last fix.

rsdfsfas

Jul 12

flash-attn 2.6.0.post1

rsdfsfas

Jul 12

maybe something bad with install, even though I did the last transformers

GPT007

Jul 12

I'm investigating (hard)

GPT007

Jul 12

•

edited Jul 12

I made two scripts (with torch.manual_seed(0)) EXACTLY SAME excepted the max_new_tokens kwarg.

32 as max_new_tokens gives a normal output, but
1024 as max_new_tokens gives just weird outputs.

GPT007

Jul 12

I tested with all powers of two and concluded that more max_new_token big is, less precision there is.

1-126 : good
256 or +: bad

GPT007

Jul 12

•

edited Jul 12

Another problem with Flash Attention 2! I'm tired of all that stupid bugs!

GPT007 changed discussion status to closed Jul 12

GPT007 changed discussion status to open Jul 12

GPT007

Jul 12

There is no direct link to the max_new_tokens kwarg

GPT007

Jul 12

attn_implementation="sdpa"

rsdfsfas

Jul 12

this works?

GPT007

Jul 12

•

edited Jul 12

SDPA = Scaled Dot Product Attention = Equivalent of Flash Attention but in native PyTorch

rsdfsfas

Jul 12

aha, but does this work?

GPT007

Jul 12

But SDPA is ~ 5 times slower

GPT007

Jul 12

•

edited Jul 12

So I recommend just google/gemma-2b or google/gemma-9b

rsdfsfas

Jul 12

Why would you use it, with FA it is not that slow.

GPT007

Jul 12

But FA 2 with Gemma 2 is bugging

rsdfsfas

Jul 12

I mean without using FA at all, it is faster than with SDPA or not. I will try, however.

GPT007

Jul 12

I made tests:

The eager attention implementation took 13.01 seconds to complete on 100 tokens with a context length of 1k.
The sdpa attention implementation took 11.83 seconds to complete on 100 tokens with a context length of 1k.
The flash_attention_2 attention implementation took 11.30 seconds to complete on 100 tokens with a context length of 1k (but is weird).

rsdfsfas

Jul 12

Awesome, sdpa a bit faster probably uses less memory as well.

rsdfsfas

Jul 12

•

edited Jul 12

For me FA2 generates much longer text so it is around 2x slower than Pytorch.

zokica

Jul 12

This comment has been hidden

rsdfsfas

Jul 12

I get the same time and memory usage for both sdpa and eager. So probably it does not work. Around 9.3gb VRAM in 4 bits.

rsdfsfas

Jul 12

And installed 2.6.1, still it does not work well.

GPT007

Jul 13

What is your GPU?

rsdfsfas

Jul 13

3090

rsdfsfas

Jul 13

not sure, should flash attention work with 4bits as well?

GPT007

Jul 13

The script: script

GPT007

Jul 13

•

edited Jul 13

flash attention works with bitsandbytes

rsdfsfas

Jul 13

•

edited Jul 13

The script worked for you, I use quite similar one? I guess soon there will be FA update .

rsdfsfas

Jul 13

flash attention works with bitsandbytes

Thanks for clarifying, then FA2 is the problem. Will do complete install after

GPT007

Jul 13

The difference of time between them is mainly caused by the time to first token so it's not a problem to use FA1 (alias SDPA).

Started process.
The eager attention implementation took 1.53s to first token.
Started process.
The sdpa attention implementation took 0.14s to first token.
Started process.
The flash_attention_2 attention implementation took 0.15s to first token.

GPT007

Jul 13

you can now transformers-4.43.0.dev0 from source!

GPT007

Jul 13

But it doesn't fix FA2:

Started process.
The eager attention implementation took 13.43s to infer 100 tokens.
The eager attention works great!
Started process.
The sdpa attention implementation took 12.06s to infer 100 tokens.
The sdpa attention works great!
Started process.
The flash_attention_2 attention implementation took 11.53s to infer 100 tokens.
The flash_attention_2 attention is weird!

GPT007

Jul 13

With transformers-4.43.0.dev0 and flash-attn-2.6.1, flash attention 2 seems biased but not too much.
On longer context, he fails.

GPT007

Jul 15

•

edited Jul 15

Wait... I think I've got it

GPT007 changed discussion status to closed Jul 15

GPT007 changed discussion status to open Jul 15

GPT007

Jul 15

https://github.com/huggingface/transformers/issues/31953

GPT007 changed discussion status to closed Jul 19

rsdfsfas

Jul 19

what is the solution actually?

GPT007

Jul 19

•

edited Jul 28

~~Use Gemma 1~~ 👇

GPT007

Jul 25

https://github.com/huggingface/transformers/pull/32188

GPT007

Jul 28

•

edited Jul 28

@rsdfsfas SOLUTION:

pip install git+https://github.com/zucchini-nlp/transformers@gemma2
pip install git+https://github.com/huggingface/transformers when This PR will be merged!
When transformers 4.44.0 will release on pip, do simply pip install transformers!

rsdfsfas

Jul 28

Perfect, thanks.

Also seems like just 4-5 lines to add in gemma2 modeling file, so it is pretty easy to install.

rsdfsfas

Jul 28

It works just by changing these lines, it is a bit slower than without flash attention and it use the same amount of memory.

Maybe there is still something broken.

It does output good response.

GPT007

Jul 28

Started process with the eager attn_implementation.
The eager attn_implementation took 15.17s to infer {tokens} tokens.
Started process with the sdpa attn_implementation.
The sdpa attn_implementation took 21.51s to infer {tokens} tokens.
Started process with the flash_attention_2 attn_implementation.
The flash_attention_2 attn_implementation took 30.53s to infer {tokens} tokens.

rsdfsfas

Jul 28

Yes, something very wrong. Probably won't be fixed.

zokica

Aug 7

This might fix this, at least memory part: https://github.com/huggingface/transformers/pull/31292

GPT007

Aug 7

I know, but we need to ask to apply it to gemma2, not only in gemma (1).

GPT007 changed discussion status to open Aug 7

zokica

Aug 7

All Gemmas are is included, as far as I know.

GPT007

Aug 8

I looked at the commits, and it changed the global generation utils AND for "the most used models" that includes Gemma (1), not Gemma 2.

GPT007

Aug 8

Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗