Challenges and Limitations in Fine-Tuning the Dots.OCR Model

#39
by JamesGs - opened

Fine-tuning this model has been extremely challenging, and there have been no recent updates. Despite over 1,170,000 downloads, there’s only one fine-tuned model available (and that one appears to be broken). This suggests that either the model has already reached its performance ceiling or that its long-prompt pretraining has led to significant overfitting.

After training the model on several private datasets, I observed only minimal improvements in accuracy. In practice, it seems more effective to either use the base model directly or fine-tune other available vision-language models (VLMs) instead.

I really wish Dots.OCR would share more details about their training setup and the datasets they used there’s a chance I might be missing something crucial in my own fine-tuning process.

Could you describe more on what is the challenge that your facing? Here are some example questions that I think can help us narrow down the problem

For example, are you finetuning the model on four specific tasks that it was originally trained on, and are you evaluating on those four task?
On the data side, what is the data you gathered and what is the quantity and quality?
What kind of regularization are you doing to make sure that the model is not overfitting to the additionally collected data?

@Hwijeen Can we use dots.ocr specifically for layout detection?

Can you provide the specific prompt for an example? Because in the link which you provided, there were many prompts present, it is slightly confusing.

Yes, using prompt layout inly en, the model will generate layout detection result without ocr. This is mentioned in the readme of the repo

Could you describe more on what is the challenge that your facing? Here are some example questions that I think can help us narrow down the problem

For example, are you finetuning the model on four specific tasks that it was originally trained on, and are you evaluating on those four task?
On the data side, what is the data you gathered and what is the quantity and quality?
What kind of regularization are you doing to make sure that the model is not overfitting to the additionally collected data?

To answer your first question to simplify things, I trained the model using only the “prompt_layout_all_en” prompt, aiming to improve bounding box detection and OCR performance. So yes, I both trained and evaluated it on “prompt_layout_all_en” only.
However, I’m wondering is it actually necessary to train on all four tasks (i.e., the four available prompts)?
For the data, I focused primarily on complex tables, such as financial statements with many merged columns and rows which is exactly where the model tends to struggle. The dataset consists of about 1,200 training samples and 400 evaluation samples, all manually annotated. It includes a mix of poor-quality scans and high-quality HTML tables.

From my experience, long prompts tend to overfit, and this was confirmed during full fine-tuning. While the training and evaluation losses were very low, real-world testing showed that the model wasn’t truly generalizing. It performed better in some areas for example, category and stamp detection as images instead of text improved but the HTML generation remained unreliable, especially with incorrect column and row spans.

This makes me suspect that the issue might not be with the vision model itself, but rather with the language model component responsible for generating the HTML structure.

I really wish dots.ocr shared more training details that would help the community contribute improvements. Compared to olmOCR2, they claim to outperform dots.ocr (both are Qwen2 finetunes), but in my tests on my usecase, that’s not the case. What’s great about olmOCR2, though, is that they actually share their training data and methodology. Maybe it’s time to move on from dots.ocr and experiment with that instead.

rednote-hilab org

Sign up or log in to comment