Fix CPU Fallback for NewAttention with xformers BlockDiagonalMask

#25
by m7mdhka - opened

escription:
This PR addresses two critical issues when running the model on CPU:

  1. Memory Efficient Attention CPU Fallback:
    • Added CPU device detection in NewAttention initialization
    • Automatically disables xformers memory efficient attention when running on CPU
    • Prevents NotImplementedError from xformers which only supports CUDA devices
    • Added proper handling of xformers BlockDiagonalMask in standard attention
    • Materializes BlockDiagonalMask to tensor before addition with attention scores

The fix ensures smooth fallback to standard attention mechanism when running on CPU while maintaining compatibility with xformers mask types.

Related issue:
NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 5471, 16, 64) (torch.float32)
key : shape=(1, 5471, 16, 64) (torch.float32)
value : shape=(1, 5471, 16, 64) (torch.float32)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'>
p : 0.0
fa2F@v2.5.7-pt is not supported because:
device=cpu (supported: {'cuda'})
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
cutlassF-pt is not supported because:
device=cpu (supported: {'cuda'})

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment