Terjman-Large-v2 / README.md
BounharAbdelaziz's picture
Update README.md
1affce4 verified
|
raw
history blame
3.01 kB
metadata
license: cc-by-nc-4.0
base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
metrics:
  - bleu
model-index:
  - name: Terjman-Large-v2
    results: []
datasets:
  - atlasia/darija_english
language:
  - ar
  - en

Terjman-Large-v2 (240M params)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of Helsinki-NLP/opus-mt-tc-big-en-ar on a the darija_english dataset enhanced with curated corpora ensuring high-quality and accurate translations. This model is an impovement of the previous version Terjman-Large.

The finetuning was conducted using a A100-40GB and took 17 hours.

Try it out on our dedicated Terjman-Large-v2 Space 🤗

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large-v2")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large-v2")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "سلام صاحبتي ممكن تقولي ليا نكتة بالدارجة المغربية؟ نفرح نسمعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-04
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 30

Framework versions

  • Transformers 4.39.2
  • Pytorch 2.2.2+cpu
  • Datasets 2.18.0
  • Tokenizers 0.15.2