Gemma 2's Flash attention 2 implementation is strange...
I tested with torch.manual_seed(0).
eager attention
=> normal resultflash attention 2
=> 1's not. The 2's "to be's for.3' for4. 2 That 4 2 the 4 that 4 for. 4's 4' to 4''' the 4'' to. 4' 4 4 4to lose to. 4 the' 4 4 4' 4' 4 the 4 the 4 4 4 ...
It is almost the same without any attention
With "eager", it works good.
yes, it should be fixed when you install new version of flash attention from source.
I installed it yesterday 😅
And on windows, so it took a few hours 😨
pip freeze | findstr flash-attn
flash-attn==2.5.9.post1
Took 2 hours, but finally installed flash-attention >= 2.6.0
4. 4: 4's over 4. I 4's. 4's. 4.
5's a great. 5' 5' 5' 6' 6. 6' 6' 6' 6' 6. 6' 6' 6' 6' 6 7. 7' 7' 7' 8' 8 8' 8' 8 8 9 9 9 9' 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
It is still making weird things
pip freeze | findstr flash-attn
flash-attn==2.6.0.post1
pip install --upgrade transformers
- Go at
tansformers/utils/import_utils
and change at line 815:
def is_flash_attn_greater_or_equal(str_version):
if not _is_package_available("flash_attn"):
return False
return version.parse(importlib.metadata.version("flash_attn")) >= version.parse(str_version)
Still facing the same issues...
yeah , I installed 2.6 of 11 July via pip and it still does not work. I thought this was because pip did not include the last fix.
flash-attn 2.6.0.post1
maybe something bad with install, even though I did the last transformers
I'm investigating (hard)
I made two scripts (with torch.manual_seed(0)
) EXACTLY SAME excepted the max_new_tokens
kwarg.
32 as max_new_tokens
gives a normal output, but
1024 as max_new_tokens
gives just weird outputs.
I tested with all powers of two and concluded that more max_new_token
big is, less precision there is.
1-126 : good
256 or +: bad
Another problem with Flash Attention 2! I'm tired of all that stupid bugs!
There is no direct link to the max_new_tokens
kwarg
attn_implementation="sdpa"
this works?
SDPA = Scaled Dot Product Attention
= Equivalent of Flash Attention
but in native PyTorch
aha, but does this work?
But SDPA is ~ 5 times slower
So I recommend just google/gemma-2b
or google/gemma-9b
Why would you use it, with FA it is not that slow.
But FA 2 with Gemma 2 is bugging
I mean without using FA at all, it is faster than with SDPA or not. I will try, however.
I made tests:
The eager attention implementation took 13.01 seconds to complete on 100 tokens with a context length of 1k.
The sdpa attention implementation took 11.83 seconds to complete on 100 tokens with a context length of 1k.
The flash_attention_2 attention implementation took 11.30 seconds to complete on 100 tokens with a context length of 1k (but is weird).
Awesome, sdpa a bit faster probably uses less memory as well.
For me FA2 generates much longer text so it is around 2x slower than Pytorch.
I get the same time and memory usage for both sdpa and eager. So probably it does not work. Around 9.3gb VRAM in 4 bits.
And installed 2.6.1, still it does not work well.
What is your GPU?
3090
not sure, should flash attention work with 4bits as well?
The script worked for you, I use quite similar one? I guess soon there will be FA update .
Thanks for clarifying, then FA2 is the problem. Will do complete install after
The difference of time between them is mainly caused by the time to first token so it's not a problem to use FA1 (alias SDPA).
Started process.
The eager attention implementation took 1.53s to first token.
Started process.
The sdpa attention implementation took 0.14s to first token.
Started process.
The flash_attention_2 attention implementation took 0.15s to first token.
you can now transformers-4.43.0.dev0
from source!
But it doesn't fix FA2:
Started process.
The eager attention implementation took 13.43s to infer 100 tokens.
The eager attention works great!
Started process.
The sdpa attention implementation took 12.06s to infer 100 tokens.
The sdpa attention works great!
Started process.
The flash_attention_2 attention implementation took 11.53s to infer 100 tokens.
The flash_attention_2 attention is weird!
With transformers-4.43.0.dev0
and flash-attn-2.6.1
, flash attention 2 seems biased but not too much.
On longer context, he fails.
Wait... I think I've got it
what is the solution actually?
Use Gemma 1 👇
Perfect, thanks.
Also seems like just 4-5 lines to add in gemma2 modeling file, so it is pretty easy to install.
It works just by changing these lines, it is a bit slower than without flash attention and it use the same amount of memory.
Maybe there is still something broken.
It does output good response.
Started process with the eager attn_implementation.
The eager attn_implementation took 15.17s to infer {tokens} tokens.
Started process with the sdpa attn_implementation.
The sdpa attn_implementation took 21.51s to infer {tokens} tokens.
Started process with the flash_attention_2 attn_implementation.
The flash_attention_2 attn_implementation took 30.53s to infer {tokens} tokens.
Yes, something very wrong. Probably won't be fixed.
This might fix this, at least memory part: https://github.com/huggingface/transformers/pull/31292
I know, but we need to ask to apply it to gemma2, not only in gemma (1).
All Gemmas are is included, as far as I know.
I looked at the commits, and it changed the global generation utils AND for "the most used models" that includes Gemma (1), not Gemma 2.
Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗