|
--- |
|
base_model: |
|
- google-t5/t5-large |
|
- billingsmoore/phonetic-tibetan-to-english-translation |
|
datasets: |
|
- billingsmoore/tibetan-to-english-translation-dataset |
|
language: |
|
- bo |
|
- en |
|
library_name: transformers |
|
license: cc |
|
metrics: |
|
- bleu |
|
pipeline_tag: translation |
|
tags: |
|
- tibetan |
|
- english |
|
- translation |
|
- nlp |
|
- buddhism |
|
- dharma |
|
--- |
|
|
|
# Model Card for tibetan-to-english-translation |
|
|
|
This model is a neural machine translation model for translating Literary Tibetan to English. |
|
|
|
The model expects Tibetan text in either Tibetan script or transliterated according to THL Simplified Phonetic Transliteration as an input and outputs an English translation. |
|
|
|
The model was evaluated using the BLEU metric as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/), with a final score of 59.3431. |
|
|
|
This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a finetuned T5 model with 770 million parameters. |
|
|
|
- **Developed by:** billingsmoore |
|
- **Model type:** [More Information Needed] |
|
- **Language(s) (NLP):** Tibetan, English |
|
- **License:** [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) |
|
- **Finetuned from model [optional]:** 'google-t5/t5-large' |
|
|
|
### Model Sources [optional] |
|
|
|
- **Repository:** [MLotsawa on Github](https://github.com/billingsmoore/MLotsawa) |
|
|
|
## Uses |
|
|
|
This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script. |
|
|
|
### Direct Use |
|
|
|
To use this model for translation you can use the following code: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
translator = pipeline('translation', 'billingsmoore/tibetan-to-english-translation') |
|
|
|
input_text = <your transliterated Tibetan text> |
|
|
|
translation = translator(input_text) |
|
|
|
print(translation) |
|
``` |
|
|
|
### Downstream Use |
|
|
|
The model can be further finetuned using the following code: |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import ( |
|
AutoTokenizer, DataCollatorForSeq2Seq, |
|
AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, |
|
Seq2SeqTrainer, EarlyStoppingCallback, Adafactor |
|
) |
|
import evaluate |
|
import numpy as np |
|
from accelerate import Accelerator |
|
|
|
data = load_dataset(<path_to_your_dataset>) |
|
|
|
checkpoint = "billingsmoore/tibetan-to-english-translation" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) |
|
|
|
source_lang = 'bo' |
|
target_lang = 'en' |
|
prefix = "translate Tibetan to English: " |
|
|
|
def preprocess_function(examples): |
|
|
|
inputs = [prefix + example[source_lang] for example in examples['translation']] |
|
targets = [example[target_lang] for example in examples['translation']] |
|
|
|
model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) |
|
|
|
return model_inputs |
|
|
|
tokenized_dataset = dataset.map(preprocess_function, batched=True) |
|
|
|
metric = evaluate.load("sacrebleu") |
|
|
|
def postprocess_text(preds, labels): |
|
preds = [pred.strip() for pred in preds] |
|
labels = [[label.strip()] for label in labels] |
|
|
|
return preds, labels |
|
|
|
|
|
def compute_metrics(eval_preds): |
|
preds, labels = eval_preds |
|
if isinstance(preds, tuple): |
|
preds = preds[0] |
|
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) |
|
|
|
labels = np.where(labels != -100, labels, tokenizer.pad_token_id) |
|
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) |
|
|
|
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) |
|
|
|
result = metric.compute(predictions=decoded_preds, references=decoded_labels) |
|
result = {"bleu": result["score"]} |
|
|
|
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds] |
|
result["gen_len"] = np.mean(prediction_lens) |
|
result = {k: round(v, 4) for k, v in result.items()} |
|
return result |
|
|
|
early_stop = EarlyStoppingCallback() |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto") |
|
|
|
optimizer = Adafactor( |
|
model.parameters(), |
|
scale_parameter=True, |
|
relative_step=False, |
|
warmup_init=False, |
|
lr=3e-4 |
|
) |
|
|
|
training_args = Seq2SeqTrainingArguments( |
|
output_dir=".", |
|
auto_find_batch_size=True, |
|
predict_with_generate=True, |
|
fp16=False, #check this |
|
push_to_hub=False, |
|
eval_strategy='epoch', |
|
save_strategy='epoch', |
|
load_best_model_at_end=True |
|
) |
|
|
|
trainer = Seq2SeqTrainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_dataset['train'], |
|
eval_dataset=tokenized_dataset['test'], |
|
tokenizer=tokenizer, |
|
optimizers=(optimizer, None), |
|
data_collator=data_collator, |
|
compute_metrics=compute_metrics, |
|
callbacks=[early_stop] |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
[Training Data for this project is available here.](https://www.kaggle.com/datasets/billingsmoore/classical-tibetan-to-english-translation-dataset) |
|
|
|
This dataset consists of 100,000 pairs of sentences or phrases. The first member of each pair is a sentence or phrase in Classical Tibetan. The second member is the English translation of the first. |
|
|
|
The pairs are pulled from texts sourced from Lotsawa House (lotsawahouse.org) and are offered under the same license as the original texts they provided. |
|
|
|
This data was scraped, cleaned, and formatted programmatically. |
|
|
|
### Training Procedure |
|
|
|
The t5 tokenizer was updated in the same manner as ['billingsmoore/tibetan-phonetic-transliteration'](https://huggingface.co/billingsmoore/tibetan-phonetic-transliteration), the procedure for which can be found on that model card. |
|
|
|
Beyond the training for ['billingsmoore/phonetic-tibetan-to-english-translation'](https://huggingface.co/billingsmoore/phonetic-tibetan-to-english-translation) whose full training is described in its model card, |
|
this model was trained for 9 epochs on the dataset ['billingsmoore/tibetan-to-english-translation-dataset'](https://huggingface.co/datasets/billingsmoore/tibetan-to-english-translation-dataset) |
|
|
|
#### Training Hyperparameters |
|
|
|
- This model was trained using the Adafactor optimizer with a learning rate of 2e-5. |
|
|
|
## Evaluation |
|
|
|
The evaluation metric for this model was the BLEU score as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/). |
|
BLEU (Bilingual Evaluation Understudy) scores measure the quality of |
|
machine-generated translations by comparing them to human-provided reference translations. The score ranges from 0 to 100, |
|
where 100 represents a perfect match with the reference translations. It evaluates the precision of n-grams (word sequences) |
|
in the generated text, with higher scores indicating closer alignment to the reference translations. A brevity penalty is applied |
|
to discourage translations that are too short. |
|
|
|
The final BLEU score was 59.3431. |