billingsmoore
/

tibetan-to-english-translation

+---
+library_name: transformers
+language:
+- bo
+- en
+base_model:
+- google-t5/t5-large
+- billingsmoore/phonetic-tibetan-to-english-translation
+license: cc
+metrics:
+- bleu
+pipeline_tag: translation
+datasets:
+- billingsmoore/tibetan-to-english-translation-dataset
+tags:
+- tibetan
+- english
+- translation
+- nlp
+- buddhism
+- dharma
+---
+# Model Card for tibetan-to-english-translation
+This model is a neural machine translation model for translating Literary Tibetan to English.
+The model expects Tibetan text in either Tibetan script or  transliterated according to THL Simplified Phonetic Transliteration as an input and outputs an English translation.
+The model was evaluated using the BLEU metric as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/), with a final score of 59.3431.
+This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International
+## Model Details
+### Model Description
+This model is a finetuned T5 model with 770 million parameters.
+- **Developed by:** billingsmoore
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** Tibetan, English
+- **License:** [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/)
+- **Finetuned from model [optional]:** 'google-t5/t5-large'
+### Model Sources [optional]
+- **Repository:** [MLotsawa on Github](https://github.com/billingsmoore/MLotsawa)
+## Uses
+This model is intended to be used as the translation model in the larger MLotsawa software, but can also be used in a Jupyter notebook or Python script.
+### Direct Use
+To use this model for translation you can use the following code:
+```python
+from transformers import pipeline
+translator = pipeline('translation', 'billingsmoore/tibetan-to-english-translation')
+input_text = <your transliterated Tibetan text>
+translation = translator(input_text)
+print(translation)
+```
+### Downstream Use
+The model can be further finetuned using the following code:
+```python
+from datasets import load_dataset
+from transformers import (
+  AutoTokenizer, DataCollatorForSeq2Seq,
+  AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments,
+  Seq2SeqTrainer, EarlyStoppingCallback, Adafactor
+)
+import evaluate
+import numpy as np
+from accelerate import Accelerator
+data = load_dataset(<path_to_your_dataset>)
+checkpoint = "billingsmoore/tibetan-to-english-translation"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
+source_lang = 'bo'
+target_lang = 'en'
+prefix = "translate Tibetan to English: "
+def preprocess_function(examples):
+    inputs = [prefix + example[source_lang] for example in examples['translation']]
+    targets = [example[target_lang] for example in examples['translation']]
+    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
+    return model_inputs
+tokenized_dataset = dataset.map(preprocess_function, batched=True)
+metric = evaluate.load("sacrebleu")
+def postprocess_text(preds, labels):
+    preds = [pred.strip() for pred in preds]
+    labels = [[label.strip()] for label in labels]
+    return preds, labels
+def compute_metrics(eval_preds):
+    preds, labels = eval_preds
+    if isinstance(preds, tuple):
+        preds = preds[0]
+    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
+    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
+    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
+    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
+    result = {"bleu": result["score"]}
+    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
+    result["gen_len"] = np.mean(prediction_lens)
+    result = {k: round(v, 4) for k, v in result.items()}
+    return result
+early_stop = EarlyStoppingCallback()
+model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
+optimizer = Adafactor(
+    model.parameters(),
+    scale_parameter=True,
+    relative_step=False,
+    warmup_init=False,
+    lr=3e-4
+)
+training_args = Seq2SeqTrainingArguments(
+    output_dir=".",
+    auto_find_batch_size=True,
+    predict_with_generate=True,
+    fp16=False, #check this
+    push_to_hub=False,
+    eval_strategy='epoch',
+    save_strategy='epoch',
+    load_best_model_at_end=True
+)
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_dataset['train'],
+    eval_dataset=tokenized_dataset['test'],
+    tokenizer=tokenizer,
+    optimizers=(optimizer, None),
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,
+    callbacks=[early_stop]
+)
+trainer.train()
+```
+## Training Details
+### Training Data
+[Training Data for this project is available here.](https://www.kaggle.com/datasets/billingsmoore/classical-tibetan-to-english-translation-dataset)
+This dataset consists of 100,000 pairs of sentences or phrases. The first member of each pair is a sentence or phrase in Classical Tibetan. The second member is the English translation of the first.
+The pairs are pulled from texts sourced from Lotsawa House (lotsawahouse.org) and are offered under the same license as the original texts they provided.
+This data was scraped, cleaned, and formatted programmatically.
+### Training Procedure
+Beyond the training for ['billingsmoore/phonetic-tibetan-to-english-translation'](https://huggingface.co/billingsmoore/phonetic-tibetan-to-english-translation) whose full training is described in its model card,
+this model was trained for 9 epochs on the dataset ['billingsmoore/tibetan-to-english-translation-dataset'](https://huggingface.co/datasets/billingsmoore/tibetan-to-english-translation-dataset)
+#### Training Hyperparameters
+- This model was trained using the Adafactor optimizer with a learning rate of 2e-5.
+## Evaluation
+The evaluation metric for this model was the BLEU score as implemented by [sacreBLEU](https://pypi.org/project/sacrebleu/).
+BLEU (Bilingual Evaluation Understudy) scores measure the quality of
+machine-generated translations by comparing them to human-provided reference translations. The score ranges from 0 to 100,
+where 100 represents a perfect match with the reference translations. It evaluates the precision of n-grams (word sequences)
+in the generated text, with higher scores indicating closer alignment to the reference translations. A brevity penalty is applied
+to discourage translations that are too short.
+The final BLEU score was 59.3431.