Question: - Encoder: ViT5-base - Max length: 32 - Pre-Processing: lower, remove special character Image: - Encoder: VIT-base - Pre-Processing: None OCR: - Text Detection: Paddle OCR - Text Recognition: VietOCR - Threshold: 0.8 - Max length: 128 - Post-processing: group layout, divide=4 Answer: - Max length: 56 Result: - Dev: - CIDEr: 3.4616 - BLEU: 0.4689