Image encoding / rescaling Question

#11
by ayyylemao - opened

You can choose the default resolution the images will be rescaled to by adding size= {"longest_edge": N*364} when initializing the processor (AutoProcessor.from_pretrained), with N your desired value. N=4 works best in practice (this is the default value), but for very large images, it could be interesting to pass N=5. This will have an impact on the number of visual tokens passed to the language model. If you are GPU-memory-constrained, you can decrease N, and choose for example N=3 or N=2, especially for low resolution images.

How is this to be understood ?

  • In the case of a square image, does it mean that if longest_edge setting is N=1, that the image gets down scaled to 364xX and the embedding only uses 169 visual tokens for the whole image?
  • If we use N=5, does this result in the original image being upscaled/downscaled to 1820x1820px (again square image source) and divided into 25 364x364 subimages or does no upscaling happen if a smaller image gets submitted than longest_edge specifies?
  • What about uneven aspect ratios? Are the images squashed into a 1:1 aspect ratio?

PS: I tried going above N=5 but there seems to be a hard coded limit. It might be nice to test N=6 for OCR tasks.

In any case, it’s not going to respect the original aspect ratio. However, it might not be too far from it if N is not too small.

By default, for each image, we rescale the longest side L to N*364. To compute the other side of the image, we first compute it as if we wanted to respect the original ratio, and knowing the side L. Then, we round this number to the following multiple of 364. We can the resize an image.

Therefore, for N=5, in general you don’t have 1820x1820 images. Only one side will be 1820 and for the other it depends.

N=6 can work but might require further fine-tuning. We noticed a clear boost only on DocVQA with N=5 already compared to N=4

If there is a bug saying that you can’t go higher, it’s probably because the default is the max size at 5*364, but you can probably modify that by changing a parameter

Sign up or log in to comment