Is it possible to only input text in Qwen/Qwen2-VL-7B-Instruct model?

#61
by ai-bond - opened

I found workaround for llava model, but how about qwen?

https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/38#6742aec0fd1e7992dc9c5070

Currently, I modified the modeling_llava.py in line 487 and successfully managed to only input text in LLaVa model

            # prefill stage vs decoding stage (legacy behavior copied)
            if input_ids.shape[1] != 1:
                if image_features is not None: ##add this
                    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                        image_features, inputs_embeds, input_ids, attention_mask, labels
                    )

Main questions is - can i train vision model on text only QA and when need it - train on image + text

If i use ShareGPT style text only datasets with FastVisionModel

File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/image_processing_qwen2_vl.py", line 79, in make_batched_images raise
ValueError(f"Could not make batched images from {images}")
ValueError: Could not make batched images from [[{'from': 'system', 'value': 'bla ....bla ....

Sign up or log in to comment