optimal input size?
Is it possible to have more information on the optimal size of input images to this model? The preprocessor config has this, which is just the default for Deformable-DETR and seems rather suspicious (too large and not square) to me:
"size": {
"longest_edge": 1333,
"shortest_edge": 800
}
DocLayNet is 1025 x 1025 images which have all been stretched to a 1:1 aspect ratio, but we have no visibility on how this model was traied so I'm not sure if that's what it should accept.
Despite being an NLP and not an image person, I am somewhat aware that feature augmentation is done in Deformable-DETR training and thus the aspect ratio and size are not super important in inference, but the memory and CPU/GPU consumption appears to be explosive (quadratic, presumably) with larger inputs. RT-DETR and YOLO tend to be 640x640.
Empirically, it seems that stretching inputs to 1:1 for inference gives worse results than keeping the default aspect ratio. Also empirically, my GT1030 with 2GB of RAM can't handle anything bigger than 800x800 ;-)
P.S. rendering PDFs at 200 dpi then downsampling them to (whatever) for inference, as Sycamore and Unstructured.io do, is not a good use of CPU cycles.
Hey, thanks for using the model.
In our training process, we don't manipulate much on the input image size and just follow the default image augmentation. If you take a look at the code for data loading, the RandomResize would resize an image to a random scale between [480 .... 800] with a step of 32, it also guarantees either edge is below the max length of 1333 in our config. So though the doclaynet image is 1025x1025, the max image feed into the training is still 800x800, and the aspect ratio for training would be always be 1:1.
But on the other hand, we do observe that using even larger(>800) size of image would give better result at inference time. You might still test and tune a bit to figure out what is the best for your own data.