ar5entum commited on
Commit
205ac95
·
verified ·
1 Parent(s): bbdb1ea

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Helsinki-NLP/opus-mt-mul-en
5
+ tags:
6
+ - generated_from_trainer
7
+ - transliteration
8
+ metrics:
9
+ - bleu
10
+ model-index:
11
+ - name: marianMT_bi_dev_rom_tl
12
+ results: []
13
+ language:
14
+ - hi
15
+ - en
16
+ datasets:
17
+ - ar5entum/hindi-english-roman-devnagiri-transliteration-corpus
18
+ ---
19
+
20
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
21
+ should probably proofread and complete it, then remove this comment. -->
22
+
23
+ # marianMT_hin_eng_cs
24
+
25
+ This model is a fine-tuned version of [Helsinki-NLP/opus-mt-hi-en](https://huggingface.co/Helsinki-NLP/opus-mt-hi-en) on [ar5entum/hindi-english-roman-devnagiri-transliteration-corpus](https://huggingface.co/datasets/ar5entum/hindi-english-roman-devnagiri-transliteration-corpus) dataset.
26
+ It achieves the following results on the evaluation set:
27
+ - Loss: 0.0947
28
+ - Bleu: 73.5282
29
+ - Gen Len: 40.8725
30
+
31
+ ## Model description
32
+
33
+ The model is specifically designed to transliterate Devnagiri and Roman text both ways trained on both English and Hindi in Devnagiri and Roman scripts.
34
+
35
+ ```python
36
+ from transformers import MarianMTModel, MarianTokenizer
37
+ import evaluate
38
+
39
+ class HinEngCS:
40
+ def __init__(self, model_name='ar5entum/marianMT_bi_dev_rom_tl'):
41
+ self.model_name = model_name
42
+ self.tokenizer = MarianTokenizer.from_pretrained(model_name)
43
+ self.model = MarianMTModel.from_pretrained(model_name)
44
+
45
+ def predict(self, input_text):
46
+ tokenized_text = self.tokenizer(input_text, return_tensors='pt')
47
+ translated = self.model.generate(**tokenized_text)
48
+ translated_text = self.tokenizer.decode(translated[0], skip_special_tokens=True)
49
+ return translated_text
50
+ model = HinEngCS()
51
+
52
+ devnagiri = [
53
+ "यह अभिषेक जल, इक्षुरस, दुध, चावल का आटा, लाल चंदन, हल्दी, अष्टगंध, चंदन चुरा, चार कलश, केसर वृष्टि, आरती, सुगंधित कलश, महाशांतिधारा एवं महाअर्घ्य के साथ भगवान नेमिनाथ को समर्पित किया जाता है।",
54
+ "कुछ ने कहा ये चांद है कुछ ने कहा चेहरा तेरा"
55
+ ]
56
+ roman = [
57
+ "yah abhishek jal, ikshuras, dudh, chaval ka ataa, laal chandan, haldi, ashtagandh, chandan chura, char kalash, kesar vrishti, aarti, sugandhit kalash, mahashantidhara evam mahaarghya ke saath bhagvan Neminath ko samarpit kiya jata hai.",
58
+ "kuch ne kaha ye chand hai kuch ne kaha chehra ter"
59
+ ]
60
+
61
+ import time
62
+ start = time.time()
63
+
64
+ predictions = [model.predict('[dev] ' + d) for d in devnagiri]
65
+ end = time.time()
66
+ print("TIME: ", end-start)
67
+ for i in range(len(devnagiri)):
68
+ print("‾‾‾‾‾‾‾‾‾‾‾‾")
69
+ print("Input text:\t", devnagiri[i])
70
+ print("Prediction:\t", predictions[i])
71
+ print("Ground Truth:\t", roman[i])
72
+ bleu = evaluate.load("bleu")
73
+ results = bleu.compute(predictions=predictions, references=roman)
74
+ print(results)
75
+ predictions = [model.predict('[rom] ' + d) for d in roman]
76
+ end = time.time()
77
+ print("TIME: ", end-start)
78
+ for i in range(len(roman)):
79
+ print("‾‾‾‾‾‾‾‾‾‾‾‾")
80
+ print("Input text:\t", roman[i])
81
+ print("Prediction:\t", predictions[i])
82
+ print("Ground Truth:\t", devnagiri[i])
83
+ bleu = evaluate.load("bleu")
84
+ results = bleu.compute(predictions=predictions, references=devnagiri)
85
+ print(results)
86
+
87
+ # TIME: 1.8382132053375244
88
+ # ‾‾‾‾‾‾‾‾‾‾‾‾
89
+ # Input text: यह अभिषेक जल, इक्षुरस, दुध, चावल का आटा, लाल चंदन, हल्दी, अष्टगंध, चंदन चुरा, चार कलश, केसर वृष्टि, आरती, सुगंधित कलश, महाशांतिधारा एवं महाअर्घ्य के साथ भगवान नेमिनाथ को समर्पित किया जाता है।
90
+ # Prediction: yah abhishek jal, ikshuras, dudh, chaval ka ataa, laal chandan, haldi, ashtagandh, chandan chura, char kalash, kesar vrishti, aarti, sugandhit kalash, mahashantidhara evam mahaarghya ke saath bhagvan Neminath ko samarpit kiya jata hai.
91
+ # Ground Truth: yah abhishek jal, ikshuras, dudh, chaval ka ataa, laal chandan, haldi, ashtagandh, chandan chura, char kalash, kesar vrishti, aarti, sugandhit kalash, mahashantidhara evam mahaarghya ke saath bhagvan Neminath ko samarpit kiya jata hai.
92
+ # ‾‾‾‾‾‾‾‾‾‾‾‾
93
+ # Input text: कुछ ने कहा ये चांद है कुछ ने कहा चेहरा तेरा
94
+ # Prediction: uchh ne kaha ye chand hai kuch ne kaha chehra tera
95
+ # Ground Truth: kuch ne kaha ye chand hai kuch ne kaha chehra ter
96
+ # {'bleu': 0.9628980475343849, 'precisions': [0.9649122807017544, 0.9636363636363636, 0.9622641509433962, 0.9607843137254902], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 57, 'reference_length': 57}
97
+
98
+
99
+ # TIME: 5.650054216384888
100
+ # ‾‾‾‾‾‾‾‾‾‾‾‾
101
+ # Input text: yah abhishek jal, ikshuras, dudh, chaval ka ataa, laal chandan, haldi, ashtagandh, chandan chura, char kalash, kesar vrishti, aarti, sugandhit kalash, mahashantidhara evam mahaarghya ke saath bhagvan Neminath ko samarpit kiya jata hai.
102
+ # Prediction: यह अभिषेक जल, इक्षुरस, दुध, चावल का आता, लाल चंदन, हल्दी, अष्टगंध, चंदन चुरा, चार कलश, केसर व्टि, आरती, सुगंधित कलश, महाशांतारा
103
+ # Ground Truth: यह अभिषेक जल, इक्षुरस, दुध, चावल का आटा, लाल चंदन, हल्दी, अष्टगंध, चंदन चुरा, चार कलश, केसर वृष्टि, आरती, सुगंधित कलश, महाशांतिधारा एवं महाअर्घ्य के साथ भगवान नेमिनाथ को समर्पित किया जाता है।
104
+ # ‾‾‾‾‾‾‾‾‾‾‾‾
105
+ # Input text: kuch ne kaha ye chand hai kuch ne kaha chehra ter
106
+ # Prediction: कुछ ने कहा ये चाँद है कुछ ने कहा चेहरा तेर
107
+ # Ground Truth: कुछ ने कहा ये चांद है कुछ ने कहा चेहरा तेरा
108
+ # {'bleu': 0.5977286781346162, 'precisions': [0.8888888888888888, 0.813953488372093, 0.7317073170731707, 0.6410256410256411], 'brevity_penalty': 0.7831394949065555, 'length_ratio': 0.8035714285714286, 'translation_length': 45, 'reference_length': 56}
109
+
110
+ ```
111
+
112
+ ## Training Procedure
113
+ ### Training hyperparameters
114
+
115
+ The following hyperparameters were used during training:
116
+ - learning_rate: 7e-05
117
+ - train_batch_size: 60
118
+ - eval_batch_size: 20
119
+ - seed: 42
120
+ - distributed_type: multi-GPU
121
+ - num_devices: 2
122
+ - total_train_batch_size: 120
123
+ - total_eval_batch_size: 40
124
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
125
+ - lr_scheduler_type: linear
126
+ - num_epochs: 500.0
127
+
128
+ ### Framework versions
129
+
130
+ - Transformers 4.45.0.dev0
131
+ - Pytorch 2.4.0+cu121
132
+ - Datasets 2.21.0
133
+ - Tokenizers 0.19.1