Mistral sliding_window implementation and flash_attn_func

#154
by SadRick - opened

I am trying to fine-tune mistral-7b using huggingface trainer and flash-attention. I have observed strange behaviour of the sliding_window, where changing its size has no effect on the training at all. I assumed that by manipulating sliding_window size I might be able to fit longer sequences into my model, but neither VRAM usage nor training time seems to be affected by sliding_window size.

Sliding_window in transformers library

I implemented some checks in MistralFlashAttention2 class (tested with a sequence length of 2048 and sliding_window sizes of 1, 512 and 2048). I found that the sliding_window sizes were working correctly:

  • use_sliding_window parameter was True
  • sliding_window size was correctly displayed
  • flash_attn_func was called with sliding_window parameter

Memory usage and training time in sagemaker

I measured peak GPU memory allocated for each training step and also time for each step:
(example)
Peak GPU memory allocated at step 92: 8239454208 bytes
Step 92 took 472.382896900177 seconds.

These measurements did not change, regardless of what sliding_window was set to (same with system logs on wandb). This seems odd to me, can someone help me understand this behaviour?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment