billingsmoore commited on
Commit
d7ced32
1 Parent(s): dfba1f7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - bo
5
+ - en
6
+ base_model:
7
+ - google-t5/t5-large
8
+ - billingsmoore/phonetic-tibetan-to-english-translation
9
+ license: cc
10
+ metrics:
11
+ - bleu
12
+ pipeline_tag: translation
13
+ datasets:
14
+ - billingsmoore/tibetan-to-english-translation-dataset
15
+ tags:
16
+ - tibetan
17
+ - english
18
+ - translation
19
+ - nlp
20
+ - buddhism
21
+ - dharma
22
+ ---
23
+
24
+ # Model Card for tibetan-to-english-translation
25
+
26
+ This model is a neural machine translation model for translating Literary Tibetan to English.
27
+
28
+ The model expects Tibetan text in either Tibetan script or transliterated according to THL Simplified Phonetic Transliteration as an input and outputs an English translation.
29
+
30
+ The model was evaluated using the BLEU metric as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/), with a final score of 59.3431.
31
+
32
+ This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International
33
+
34
+ ## Model Details
35
+
36
+ ### Model Description
37
+
38
+ This model is a finetuned T5 model with 770 million parameters.
39
+
40
+ - **Developed by:** billingsmoore
41
+ - **Model type:** [More Information Needed]
42
+ - **Language(s) (NLP):** Tibetan, English
43
+ - **License:** [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/)
44
+ - **Finetuned from model [optional]:** 'google-t5/t5-large'
45
+
46
+ ### Model Sources [optional]
47
+
48
+ - **Repository:** [MLotsawa on Github](https://github.com/billingsmoore/MLotsawa)
49
+
50
+ ## Uses
51
+
52
+ This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script.
53
+
54
+ ### Direct Use
55
+
56
+ To use this model for translation you can use the following code:
57
+
58
+ ```python
59
+ from transformers import pipeline
60
+
61
+ translator = pipeline('translation', 'billingsmoore/tibetan-to-english-translation')
62
+
63
+ input_text = <your transliterated Tibetan text>
64
+
65
+ translation = translator(input_text)
66
+
67
+ print(translation)
68
+ ```
69
+
70
+ ### Downstream Use
71
+
72
+ The model can be further finetuned using the following code:
73
+
74
+ ```python
75
+ from datasets import load_dataset
76
+ from transformers import (
77
+ AutoTokenizer, DataCollatorForSeq2Seq,
78
+ AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments,
79
+ Seq2SeqTrainer, EarlyStoppingCallback, Adafactor
80
+ )
81
+ import evaluate
82
+ import numpy as np
83
+ from accelerate import Accelerator
84
+
85
+ data = load_dataset(<path_to_your_dataset>)
86
+
87
+ checkpoint = "billingsmoore/tibetan-to-english-translation"
88
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
89
+ data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
90
+
91
+ source_lang = 'bo'
92
+ target_lang = 'en'
93
+ prefix = "translate Tibetan to English: "
94
+
95
+ def preprocess_function(examples):
96
+
97
+ inputs = [prefix + example[source_lang] for example in examples['translation']]
98
+ targets = [example[target_lang] for example in examples['translation']]
99
+
100
+ model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
101
+
102
+ return model_inputs
103
+
104
+ tokenized_dataset = dataset.map(preprocess_function, batched=True)
105
+
106
+ metric = evaluate.load("sacrebleu")
107
+
108
+ def postprocess_text(preds, labels):
109
+ preds = [pred.strip() for pred in preds]
110
+ labels = [[label.strip()] for label in labels]
111
+
112
+ return preds, labels
113
+
114
+
115
+ def compute_metrics(eval_preds):
116
+ preds, labels = eval_preds
117
+ if isinstance(preds, tuple):
118
+ preds = preds[0]
119
+ decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
120
+
121
+ labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
122
+ decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
123
+
124
+ decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
125
+
126
+ result = metric.compute(predictions=decoded_preds, references=decoded_labels)
127
+ result = {"bleu": result["score"]}
128
+
129
+ prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
130
+ result["gen_len"] = np.mean(prediction_lens)
131
+ result = {k: round(v, 4) for k, v in result.items()}
132
+ return result
133
+
134
+ early_stop = EarlyStoppingCallback()
135
+
136
+ model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
137
+
138
+ optimizer = Adafactor(
139
+ model.parameters(),
140
+ scale_parameter=True,
141
+ relative_step=False,
142
+ warmup_init=False,
143
+ lr=3e-4
144
+ )
145
+
146
+ training_args = Seq2SeqTrainingArguments(
147
+ output_dir=".",
148
+ auto_find_batch_size=True,
149
+ predict_with_generate=True,
150
+ fp16=False, #check this
151
+ push_to_hub=False,
152
+ eval_strategy='epoch',
153
+ save_strategy='epoch',
154
+ load_best_model_at_end=True
155
+ )
156
+
157
+ trainer = Seq2SeqTrainer(
158
+ model=model,
159
+ args=training_args,
160
+ train_dataset=tokenized_dataset['train'],
161
+ eval_dataset=tokenized_dataset['test'],
162
+ tokenizer=tokenizer,
163
+ optimizers=(optimizer, None),
164
+ data_collator=data_collator,
165
+ compute_metrics=compute_metrics,
166
+ callbacks=[early_stop]
167
+ )
168
+
169
+ trainer.train()
170
+ ```
171
+
172
+ ## Training Details
173
+
174
+ ### Training Data
175
+
176
+ [Training Data for this project is available here.](https://www.kaggle.com/datasets/billingsmoore/classical-tibetan-to-english-translation-dataset)
177
+
178
+ This dataset consists of 100,000 pairs of sentences or phrases. The first member of each pair is a sentence or phrase in Classical Tibetan. The second member is the English translation of the first.
179
+
180
+ The pairs are pulled from texts sourced from Lotsawa House (lotsawahouse.org) and are offered under the same license as the original texts they provided.
181
+
182
+ This data was scraped, cleaned, and formatted programmatically.
183
+
184
+ ### Training Procedure
185
+
186
+ Beyond the training for ['billingsmoore/phonetic-tibetan-to-english-translation'](https://huggingface.co/billingsmoore/phonetic-tibetan-to-english-translation) whose full training is described in its model card,
187
+ this model was trained for 9 epochs on the dataset ['billingsmoore/tibetan-to-english-translation-dataset'](https://huggingface.co/datasets/billingsmoore/tibetan-to-english-translation-dataset)
188
+
189
+ #### Training Hyperparameters
190
+
191
+ - This model was trained using the Adafactor optimizer with a learning rate of 2e-5.
192
+
193
+ ## Evaluation
194
+
195
+ The evaluation metric for this model was the BLEU score as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/).
196
+ BLEU (Bilingual Evaluation Understudy) scores measure the quality of
197
+ machine-generated translations by comparing them to human-provided reference translations. The score ranges from 0 to 100,
198
+ where 100 represents a perfect match with the reference translations. It evaluates the precision of n-grams (word sequences)
199
+ in the generated text, with higher scores indicating closer alignment to the reference translations. A brevity penalty is applied
200
+ to discourage translations that are too short.
201
+
202
+ The final BLEU score was 59.3431.