Is it possible to run inference on MiniMonkey without using Flash attention?
I have an old GPU, and i keep getting the following error
"FlashAttention only supports Ampere GPUs or newer."
I can usually by pass this by setting
attn_implementation='eager' or use_flash_attn=False
But none of these have worked so far. I am looking at the code in detail, but before I spend time going through it, I would like to know if it is at all possible to run inference on the model without using Flash Attention.
Thank you for the suggestion. I did what you suggested and turns out you also have to change use_flash_attn to false.
Below is the code that worked for me. Hope it helps if anyone else had a similar error.
path = 'mx262/MiniMonkey'
config, unused_kwargs = AutoConfig.from_pretrained(path, trust_remote_code=True, return_unused_kwargs=True)
Modify 'attn_implementation' inside 'llm_config'
config.llm_config.attn_implementation = "eager"
config.vision_config.use_flash_attn = False
model = AutoModel.from_pretrained(
path,
config = config,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)