Is it possible to only input text in Qwen/Qwen2-VL-7B-Instruct model?
#61
by
ai-bond
- opened
I found workaround for llava model, but how about qwen?
https://huggingface.co/llava-hf/llava-1.5-7b-hf/discussions/38#6742aec0fd1e7992dc9c5070
Currently, I modified the modeling_llava.py in line 487 and successfully managed to only input text in LLaVa model
# prefill stage vs decoding stage (legacy behavior copied) if input_ids.shape[1] != 1: if image_features is not None: ##add this inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features( image_features, inputs_embeds, input_ids, attention_mask, labels )
Main questions is - can i train vision model on text only QA and when need it - train on image + text
If i use ShareGPT style text only datasets with FastVisionModel
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/image_processing_qwen2_vl.py", line 79, in make_batched_images raise
ValueError(f"Could not make batched images from {images}")
ValueError: Could not make batched images from [[{'from': 'system', 'value': 'bla ....bla ....