Fix CPU Fallback for NewAttention with xformers BlockDiagonalMask
escription:
This PR addresses two critical issues when running the model on CPU:
- Memory Efficient Attention CPU Fallback:
- Added CPU device detection in NewAttention initialization
- Automatically disables xformers memory efficient attention when running on CPU
- Prevents NotImplementedError from xformers which only supports CUDA devices
- Added proper handling of xformers BlockDiagonalMask in standard attention
- Materializes BlockDiagonalMask to tensor before addition with attention scores
The fix ensures smooth fallback to standard attention mechanism when running on CPU while maintaining compatibility with xformers mask types.
Related issue:
NotImplementedError: No operator found for memory_efficient_attention_forward
with inputs:
query : shape=(1, 5471, 16, 64) (torch.float32)
key : shape=(1, 5471, 16, 64) (torch.float32)
value : shape=(1, 5471, 16, 64) (torch.float32)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalMask'>
p : 0.0fa2F@v2.5.7-pt
is not supported because:
device=cpu (supported: {'cuda'})
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})cutlassF-pt
is not supported because:
device=cpu (supported: {'cuda'})