Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
Eval Results
Inference Endpoints

Missing steps

#8
by Martins6 - opened

Thanks for the awesome project and sharing of the weights! You guys rock!

On the llava module on the load_pretrained_model function, it has the following line:

model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)

However, there's no class that it is calling. I know this may be a llava problem, but maybe you guys can point a solution? Otherwise, it seems your code is currently unusable..

Martins6 changed discussion status to closed
This comment has been hidden

Guys, so I inspected it further. It seems there's just some missing steps

First, I had to install all of this packages, it would be nice to document this:

"accelerate>=1.0.1",
"av>=13.1.0",
"boto3>=1.35.46",
"decord>=0.6.0",
"einops>=0.6.0",
"flash-attn",
"llava",
"open-clip-torch>=2.28.0",
"transformers>=4.45.2",

Second, the load_pretrained_model function was simply stopping to work when loading the Qwen model.
I had to create a new function to load everything that was necessary:

def load_model():
    model_name = "llava_qwen"
    device_map = "auto"

    model_path = "lmms-lab/LLaVA-Video-7B-Qwen2"
    attn_implementation = None  # "flash_attention_2"
    kwargs = {"device_map": "auto", "torch_dtype": torch.float16}

    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model = LlavaQwenForCausalLM.from_pretrained(
        model_path,
        low_cpu_mem_usage=True,
        attn_implementation=attn_implementation,
        **kwargs,
    )

    if "llava" in model_name.lower():
        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
        if mm_use_im_patch_token:
            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
        if mm_use_im_start_end:
            tokenizer.add_tokens(
                [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
            )
        model.resize_token_embeddings(len(tokenizer))

        vision_tower = model.get_vision_tower()
        if not vision_tower.is_loaded:
            vision_tower.load_model(device_map=device_map)
        if device_map != "auto":
            vision_tower.to(device="cuda", dtype=torch.float16)
        image_processor = vision_tower.image_processor

    return model, tokenizer, image_processor
Martins6 changed discussion status to open
Martins6 changed discussion title from LlavaQwenForCausalLM not found. to tokenizer_image_token not working
Martins6 changed discussion title from tokenizer_image_token not working to Missing steps

Hi, I successfully ran the inference code with the 7B model, but encountered an issue when switching to the 32B model. Have you experienced any problems running the 32B model?

hey @RachelZhou , I don't have enough compute to test that :/
But if I do I'll report back to you! Hope that tips I gave here, may help you out on your project

hey @RachelZhou , I did try it, and got some buggy results too. Don't have the Traceback unfortunately. But 72B model runs super smoothly! hope it helps!

I could run the original code once I ensured flash-attn was successfully installed!

Thank you for sharing your experience!!

Sign up or log in to comment