Error when running the model as per code instructions in model card
I was trying to run this model using the code instructions as per the model card and encountered the error shared below. Note, that I am running it on Google Colab T4 runtime.
Code to replicate:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
model_dir = 'cmeraki/OpenHathi-7B-Hi-v0.1-Base-gptq'
model = AutoGPTQForCausalLM.from_quantized(model_dir, device="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_dir, fast=True)
tokens = tokenizer("do aur do", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**tokens, max_length=1024)[0]))
Error:
```
WARNING:auto_gptq.nn_modules.fused_llama_mlp:Skipping module injection for FusedLlamaMLPForQuantizedModel as currently not supported with use_triton=False.
TypeError Traceback (most recent call last)
in <cell line: 10>()
8 tokens = tokenizer("do aur do", return_tensors="pt").to(model.device)
9
---> 10 print(tokenizer.decode(model.generate(**tokens, max_length=1024)[0]))
4 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py in prepare_inputs_for_generation(self, input_ids, past_key_values, attention_mask, inputs_embeds, **kwargs)
1082 ):
1083 if past_key_values is not None:
-> 1084 past_length = past_key_values[0][0].shape[2]
1085
1086 # Some generation methods already pass only the last input ID
TypeError: 'NoneType' object is not subscriptable