bart_hin_eng_mt / README.md
ar5entum's picture
Update README.md
7654349 verified
metadata
library_name: transformers
base_model: danasone/bart-small-ru-en
tags:
  - generated_from_trainer
metrics:
  - bleu
model-index:
  - name: bart_hin_eng_mt
    results: []

bart_hin_eng_mt

This model is a fine-tuned version of danasone/bart-small-ru-en on cfilt/iitb-english-hindi dataset. It achieves the following results on the evaluation set:

  • Loss: 1.9000
  • Bleu: 12.0235
  • Gen Len: 33.4107

Model description

Machine Translation model from Hindi to English on bart small model.

Inference and evaluation

import torch
import evaluate
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class BartSmall():
    def __init__(self, model_path = 'ar5entum/bart_hin_eng_mt', device = None):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
        if not device:
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.device = device
        self.model.to(device)

    def predict(self, input_text):
        inputs = self.tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True).to(self.device)
        pred_ids = self.model.generate(inputs.input_ids, max_length=512, num_beams=4, early_stopping=True)
        prediction = self.tokenizer.decode(pred_ids[0], skip_special_tokens=True)
        return prediction
    
    def predict_batch(self, input_texts, batch_size=32):
        all_predictions = []
        for i in range(0, len(input_texts), batch_size):
            batch_texts = input_texts[i:i+batch_size]
            inputs = self.tokenizer(batch_texts, return_tensors="pt", max_length=512, 
                                    truncation=True, padding=True).to(self.device)
            
            with torch.no_grad():
                pred_ids = self.model.generate(inputs.input_ids, 
                                               max_length=512, 
                                               num_beams=4, 
                                               early_stopping=True)
            
            predictions = self.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
            all_predictions.extend(predictions)

        return all_predictions

model = BartSmall(device='cuda')

input_texts = [
    "यह शोध्य रकम है।", 
    "जानने के लिए देखें ये वीडियो.",
    "वह दो बेटियों व एक बेटे का पिता था।"
    ]
ground_truths = [
    "This is a repayable amount.",
    "Watch this video to find out.",
    "He was a father of two daughters and a son."
    ]
import time
start = time.time()

predictions = model.predict_batch(input_texts, batch_size=len(input_texts))
end = time.time()
print("TIME: ", end-start)
for i in range(len(input_texts)):
    print("‾‾‾‾‾‾‾‾‾‾‾‾")
    print("Input text:\t", input_texts[i])
    print("Prediction:\t", predictions[i])
    print("Ground Truth:\t", ground_truths[i])
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=ground_truths)
print(results)

# TIME:  1.2374696731567383
# ‾‾‾‾‾‾‾‾‾‾‾‾
# Input text:	 यह शोध्य रकम है।
# Prediction:	 This is a repayable amount.
# Ground Truth:	 This is a repayable amount.
# ‾‾‾‾‾‾‾‾‾‾‾‾
# Input text:	 जानने के लिए देखें ये वीडियो.
# Prediction:	 View these videos to know.
# Ground Truth:	 Watch this video to find out.
# ‾‾‾‾‾‾‾‾‾‾‾‾
# Input text:	 वह दो बेटियों व एक बेटे का पिता था।
# Prediction:	 He was a father of two daughters and a son.
# Ground Truth:	 He was a father of two daughters and a son.
# {'bleu': 0.747875245486914, 'precisions': [0.8260869565217391, 0.75, 0.7647058823529411, 0.7857142857142857], 'brevity_penalty': 0.9574533680683809, 'length_ratio': 0.9583333333333334, 'translation_length': 23, 'reference_length': 24}

Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 100
  • eval_batch_size: 40
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • total_train_batch_size: 200
  • total_eval_batch_size: 80
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 15.0

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
2.6298 1.0 8265 2.6192 4.5435 39.8786
2.2656 2.0 16530 2.2836 8.2498 35.8339
2.0625 3.0 24795 2.1747 9.9182 35.5214
1.974 4.0 33060 2.0760 10.1515 33.9732
1.925 5.0 41325 2.0285 10.7702 34.175
1.8076 6.0 49590 1.9860 11.4286 34.8875
1.7817 7.0 57855 1.9664 11.4579 32.6411
1.7025 8.0 66120 1.9561 11.9226 33.5179
1.6691 9.0 74385 1.9354 11.7352 33.2161
1.6631 10.0 82650 1.9231 11.9303 32.7679
1.6317 11.0 90915 1.9264 11.5889 32.625
1.6449 12.0 99180 1.9047 11.8451 33.8554
1.6165 13.0 107445 1.9040 12.0755 32.7661
1.5826 14.0 115710 1.9000 12.3137 33.3536
1.5835 15.0 123975 1.9000 12.0235 33.4107

Framework versions

  • Transformers 4.45.0.dev0
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1