Question:
- Encoder: ViT5-base
- Max length: 32
- Pre-Processing: lower, remove special character
Image:
- Encoder: VIT-base
- Pre-Processing: None
OCR:
Text Detection: Paddle OCR
Text Recognition: VietOCR
- Threshold: 0.8
Max length: 128
Post-processing: group layout, divide=4
Answer:
- Max length: 56
Result:
- Dev:
- CIDEr: 3.4616
- BLEU: 0.4689