What processor to use?

#4
by floschne - opened

Hi, and thanks for sharing this nice piece of work :-)

In your example code, the pixel_valuesare random numbers, so I am wondering which Processor to use for real images and random text.
I already cloned this repo and tried to use the SigLipProcessorfrom processing_siglip.py but I get the following error: OSError: HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit/main' for available files.

Does anybody know a solution?

hi @floschne
thanks!
I think you indeed would need to create a preprocessor_config.json file. Would you like to create a PR for that?
i suspect that most the arguments can be copied over from https://huggingface.co/HuggingFaceM4/idefics2-8b/blob/main/preprocessor_config.json

actually, thinking a little more about it, the official processor from the official integration of siglip into transformers would be the best processor to use (and potentially modify the image size)
we mainly created that duplicate model for expanding the resolution and integrating flash attention into the attention computation

Hello, thanks for the nice work also, but to be honest, I don't think this modification can properly called navit, since it just doesn't take full use of navit actually.

From the first glance, your model were just resize original image to 980x980, how does the reserved aspect ratio? Even tough you said you just fill the image without any change aspect ratio resize, and rest peixels are padded with certrain pixels, this still didn't navit does in their original paper.

from my understanding, navit were sovling multiple images that with different sizes, first directly patch them, and fill them one by one to a long sequence.

So that I am very confused how to really make full use of this repo. My questions are:

  1. How to really keep my image not resize to 980x980?
  2. What it benifiti compare with, I directly resize to 980x980 and using interpolate on position embedding and get the final result?

hi @lucasjin
the block in https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit/blob/main/modeling_siglip.py#L316 takes care of the "preserving the original image ratio" logic.
it computes the necessary position ids to extract to accommodate the image in input and more specifically its size and ratio.
Images are only downsampled IF one of the side is > 980.

@VictorSanh Hi, thanks for the reply. Let's look at the very first inputs: the images are first preprocessed via SiglipImageProcessor, which set to 980x980 from config, so that the pixel_values into vit would be Bx3x980x980, how does it tell what the aspect ratio be?

Hoping for your detailed analysis. thank u!

@VictorSanh I have the same question about the NaViT implementation.
I think the shape vision_model.embeddings.position_embedding.weight should be like (70 * 2, 1152) instead of (70 * 70, 1152) if NaViT factorized fractional position embedding is used.
The current implementation and weights seems indicate that idefics2 just re-trains the siglip with a 980x980 resolution from scratch.

Oh, man, i fonud toom nay bugs in idefics2 implementation....

@VictorSanh Hi, thanks for the reply. Let's look at the very first inputs: the images are first preprocessed via SiglipImageProcessor, which set to 980x980 from config, so that the pixel_values into vit would be Bx3x980x980, how does it tell what the aspect ratio be?

you should modify the processor to NOT resize. we don't use the processor in this current repo for idefics, but the idefics one

@VictorSanh I have the same question about the NaViT implementation.
I think the shape vision_model.embeddings.position_embedding.weight should be like (70 * 2, 1152) instead of (70 * 70, 1152) if NaViT factorized fractional position embedding is used.

@YuntaoChen , we use 2d absolute position embeddings, not factorized fractional position embeddings

Sign up or log in to comment