Clarification needed about the indic to indic model
- Is the Indic-to-Indic translation model a combination of separate Indic-to-English and English-to-Indic translation models?
- If it is a stitched model, has there been any fine-tuning or additional training on the combined model to improve translation quality?
- Alternatively, was a dedicated Indic-to-Indic model trained from scratch specifically for direct translations?
Thanks for your interest in our work. Here are the responses to your questions:
Is the Indic-to-Indic translation model a combination of separate Indic-to-English and English-to-Indic translation models?
Yes, we initialize our Indic-to-Indic model by combining the pre-trained encoder from our Indic-En model with the pre-trained decoder from our En-Indic model. These modules are trained individually on substantial parallel corpora for their respective translation directions. However, due to the lack of alignment in their embedding spaces, they are not directly compatible for zero-shot translation post-initialization. To bridge this gap, we fine-tune the combined Indic-to-Indic model using limited high-quality data (BPCC-H Wiki data and synthetic data).
If it is a stitched model, has there been any fine-tuning or additional training on the combined model to improve translation quality?
Yes, we fine-tune the initialized Indic-to-Indic model. We use a pivoted version of the BPCC-H Wiki dataset comprising 9.2 million entries spanning 462 pairs within the Indic-Indic linguistic domain. Furthermore, a set of 100K synthetic bitext pairs was created for each translation direction, amounting to a total of 46.2 million pairs across the 462 Indic-Indic pairs. This synthetic data is augmented by selecting 100K English monolingual sentences from IndicCorpv2 and utilizing the IndicTrans2 En-Indic model for translation across all 22 intended languages. Consequently, this process results in an n-way seed corpus with 100K sentences per direction across the 462 specified directions.
The aggregate data employed for fine-tuning encompasses 55.4 million pairs across all supported translation directions, accounting for the complete set of 462 directions.
It is important to note that the support for 462 Indic-Indic directions is achieved with a mere 25% of the data scales utilized in the training of IndicTrans2 auxiliary models, which in turn supported 25 English-centric directions.
Alternatively, was a dedicated Indic-to-Indic model trained from scratch specifically for direct translations?
No, we opted not to train an Indic-to-Indic model from scratch due to the following reasons:
- Data scarcity: Close to no data for low-resource pairs, particularly in the Indic-Indic setting
- Data imbalance: Hindi-centric pairs are dominant, which may lead to poorer performance.
- Compute heavy: In general, scales of Indic-Indic data are much lower and might need to train with a combination of En-centric and Indic-Indic data, which is computationally expensive.
Thank you for the response, This explanation helped a lot.
Is there any available comparison between direct Indic-to-Indic translation models and translations performed by first converting Indic to English and then English to Indic?
The figure below shows the performance differences between the pivot and M2M models. Blue bars represent the performance difference when translating from the source language (shown on the x-axis) to any other Indic language. On the other hand, Red bars indicate the performance difference when translating from any other Indic language to the target language (shown on the x-axis).
Thank you so much, This is helpful