billingsmoore
commited on
Commit
•
d7ced32
1
Parent(s):
dfba1f7
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,202 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
language:
|
4 |
+
- bo
|
5 |
+
- en
|
6 |
+
base_model:
|
7 |
+
- google-t5/t5-large
|
8 |
+
- billingsmoore/phonetic-tibetan-to-english-translation
|
9 |
+
license: cc
|
10 |
+
metrics:
|
11 |
+
- bleu
|
12 |
+
pipeline_tag: translation
|
13 |
+
datasets:
|
14 |
+
- billingsmoore/tibetan-to-english-translation-dataset
|
15 |
+
tags:
|
16 |
+
- tibetan
|
17 |
+
- english
|
18 |
+
- translation
|
19 |
+
- nlp
|
20 |
+
- buddhism
|
21 |
+
- dharma
|
22 |
+
---
|
23 |
+
|
24 |
+
# Model Card for tibetan-to-english-translation
|
25 |
+
|
26 |
+
This model is a neural machine translation model for translating Literary Tibetan to English.
|
27 |
+
|
28 |
+
The model expects Tibetan text in either Tibetan script or transliterated according to THL Simplified Phonetic Transliteration as an input and outputs an English translation.
|
29 |
+
|
30 |
+
The model was evaluated using the BLEU metric as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/), with a final score of 59.3431.
|
31 |
+
|
32 |
+
This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International
|
33 |
+
|
34 |
+
## Model Details
|
35 |
+
|
36 |
+
### Model Description
|
37 |
+
|
38 |
+
This model is a finetuned T5 model with 770 million parameters.
|
39 |
+
|
40 |
+
- **Developed by:** billingsmoore
|
41 |
+
- **Model type:** [More Information Needed]
|
42 |
+
- **Language(s) (NLP):** Tibetan, English
|
43 |
+
- **License:** [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/)
|
44 |
+
- **Finetuned from model [optional]:** 'google-t5/t5-large'
|
45 |
+
|
46 |
+
### Model Sources [optional]
|
47 |
+
|
48 |
+
- **Repository:** [MLotsawa on Github](https://github.com/billingsmoore/MLotsawa)
|
49 |
+
|
50 |
+
## Uses
|
51 |
+
|
52 |
+
This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script.
|
53 |
+
|
54 |
+
### Direct Use
|
55 |
+
|
56 |
+
To use this model for translation you can use the following code:
|
57 |
+
|
58 |
+
```python
|
59 |
+
from transformers import pipeline
|
60 |
+
|
61 |
+
translator = pipeline('translation', 'billingsmoore/tibetan-to-english-translation')
|
62 |
+
|
63 |
+
input_text = <your transliterated Tibetan text>
|
64 |
+
|
65 |
+
translation = translator(input_text)
|
66 |
+
|
67 |
+
print(translation)
|
68 |
+
```
|
69 |
+
|
70 |
+
### Downstream Use
|
71 |
+
|
72 |
+
The model can be further finetuned using the following code:
|
73 |
+
|
74 |
+
```python
|
75 |
+
from datasets import load_dataset
|
76 |
+
from transformers import (
|
77 |
+
AutoTokenizer, DataCollatorForSeq2Seq,
|
78 |
+
AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments,
|
79 |
+
Seq2SeqTrainer, EarlyStoppingCallback, Adafactor
|
80 |
+
)
|
81 |
+
import evaluate
|
82 |
+
import numpy as np
|
83 |
+
from accelerate import Accelerator
|
84 |
+
|
85 |
+
data = load_dataset(<path_to_your_dataset>)
|
86 |
+
|
87 |
+
checkpoint = "billingsmoore/tibetan-to-english-translation"
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
89 |
+
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
|
90 |
+
|
91 |
+
source_lang = 'bo'
|
92 |
+
target_lang = 'en'
|
93 |
+
prefix = "translate Tibetan to English: "
|
94 |
+
|
95 |
+
def preprocess_function(examples):
|
96 |
+
|
97 |
+
inputs = [prefix + example[source_lang] for example in examples['translation']]
|
98 |
+
targets = [example[target_lang] for example in examples['translation']]
|
99 |
+
|
100 |
+
model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
|
101 |
+
|
102 |
+
return model_inputs
|
103 |
+
|
104 |
+
tokenized_dataset = dataset.map(preprocess_function, batched=True)
|
105 |
+
|
106 |
+
metric = evaluate.load("sacrebleu")
|
107 |
+
|
108 |
+
def postprocess_text(preds, labels):
|
109 |
+
preds = [pred.strip() for pred in preds]
|
110 |
+
labels = [[label.strip()] for label in labels]
|
111 |
+
|
112 |
+
return preds, labels
|
113 |
+
|
114 |
+
|
115 |
+
def compute_metrics(eval_preds):
|
116 |
+
preds, labels = eval_preds
|
117 |
+
if isinstance(preds, tuple):
|
118 |
+
preds = preds[0]
|
119 |
+
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
|
120 |
+
|
121 |
+
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
|
122 |
+
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
|
123 |
+
|
124 |
+
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
|
125 |
+
|
126 |
+
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
|
127 |
+
result = {"bleu": result["score"]}
|
128 |
+
|
129 |
+
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
|
130 |
+
result["gen_len"] = np.mean(prediction_lens)
|
131 |
+
result = {k: round(v, 4) for k, v in result.items()}
|
132 |
+
return result
|
133 |
+
|
134 |
+
early_stop = EarlyStoppingCallback()
|
135 |
+
|
136 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
|
137 |
+
|
138 |
+
optimizer = Adafactor(
|
139 |
+
model.parameters(),
|
140 |
+
scale_parameter=True,
|
141 |
+
relative_step=False,
|
142 |
+
warmup_init=False,
|
143 |
+
lr=3e-4
|
144 |
+
)
|
145 |
+
|
146 |
+
training_args = Seq2SeqTrainingArguments(
|
147 |
+
output_dir=".",
|
148 |
+
auto_find_batch_size=True,
|
149 |
+
predict_with_generate=True,
|
150 |
+
fp16=False, #check this
|
151 |
+
push_to_hub=False,
|
152 |
+
eval_strategy='epoch',
|
153 |
+
save_strategy='epoch',
|
154 |
+
load_best_model_at_end=True
|
155 |
+
)
|
156 |
+
|
157 |
+
trainer = Seq2SeqTrainer(
|
158 |
+
model=model,
|
159 |
+
args=training_args,
|
160 |
+
train_dataset=tokenized_dataset['train'],
|
161 |
+
eval_dataset=tokenized_dataset['test'],
|
162 |
+
tokenizer=tokenizer,
|
163 |
+
optimizers=(optimizer, None),
|
164 |
+
data_collator=data_collator,
|
165 |
+
compute_metrics=compute_metrics,
|
166 |
+
callbacks=[early_stop]
|
167 |
+
)
|
168 |
+
|
169 |
+
trainer.train()
|
170 |
+
```
|
171 |
+
|
172 |
+
## Training Details
|
173 |
+
|
174 |
+
### Training Data
|
175 |
+
|
176 |
+
[Training Data for this project is available here.](https://www.kaggle.com/datasets/billingsmoore/classical-tibetan-to-english-translation-dataset)
|
177 |
+
|
178 |
+
This dataset consists of 100,000 pairs of sentences or phrases. The first member of each pair is a sentence or phrase in Classical Tibetan. The second member is the English translation of the first.
|
179 |
+
|
180 |
+
The pairs are pulled from texts sourced from Lotsawa House (lotsawahouse.org) and are offered under the same license as the original texts they provided.
|
181 |
+
|
182 |
+
This data was scraped, cleaned, and formatted programmatically.
|
183 |
+
|
184 |
+
### Training Procedure
|
185 |
+
|
186 |
+
Beyond the training for ['billingsmoore/phonetic-tibetan-to-english-translation'](https://huggingface.co/billingsmoore/phonetic-tibetan-to-english-translation) whose full training is described in its model card,
|
187 |
+
this model was trained for 9 epochs on the dataset ['billingsmoore/tibetan-to-english-translation-dataset'](https://huggingface.co/datasets/billingsmoore/tibetan-to-english-translation-dataset)
|
188 |
+
|
189 |
+
#### Training Hyperparameters
|
190 |
+
|
191 |
+
- This model was trained using the Adafactor optimizer with a learning rate of 2e-5.
|
192 |
+
|
193 |
+
## Evaluation
|
194 |
+
|
195 |
+
The evaluation metric for this model was the BLEU score as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/).
|
196 |
+
BLEU (Bilingual Evaluation Understudy) scores measure the quality of
|
197 |
+
machine-generated translations by comparing them to human-provided reference translations. The score ranges from 0 to 100,
|
198 |
+
where 100 represents a perfect match with the reference translations. It evaluates the precision of n-grams (word sequences)
|
199 |
+
in the generated text, with higher scores indicating closer alignment to the reference translations. A brevity penalty is applied
|
200 |
+
to discourage translations that are too short.
|
201 |
+
|
202 |
+
The final BLEU score was 59.3431.
|