Model Card: English–Faroese Translation Adapter
Model Details
Model Description
- Developed by: Barbara Scalvini
- Model type: Language model adapter for English → Faroese translation
- Language(s): English, Faroese
- License: This adapter inherits the license from the original Llama 3.1 8B model.
- Finetuned from model: meta-llama/Meta-Llama-3.1-8B
- Library used: PEFT 0.13.0
Model Sources
- Paper: [COMING SOON]
Uses
Direct Use
This adapter is intended to perform English→Faroese translation, leveraging a parameter-efficient fine-tuning (PEFT) approach.
Downstream Use [optional]
- Can be integrated into broader multilingual or localization workflows.
Out-of-Scope Use
- Any uses that rely on languages other than English or Faroese will likely yield suboptimal results.
- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.
Bias, Risks, and Limitations
- Biases: The model could reflect biases present in the training data, such as historical or societal biases in English or Faroese texts.
- Recommendation: Users should critically evaluate outputs, especially in sensitive or high-stakes applications.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the trained model and tokenizer from the checkpoint
checkpoint_dir = "barbaroo/llama3.1_translate_8B" # The directory where your trained model and tokenizer are saved
model = AutoModelForCausalLM.from_pretrained(checkpoint_dir, device_map="auto", load_in_8bit = True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
MAX_SEQ_LENGTH = 512
sentences = ["What's your name?"]
# Define the prompt template (same as in training)
alpaca_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Inference loop
for sentence in sentences:
inputs = tokenizer(
[
alpaca_prompt.format(
"Translate this sentence from English to Faroese:", # Instruction
sentence, # The input sentence to translate
"", # Leave blank for generation
)
],
return_tensors="pt",
padding=True,
truncation=True, # Make sure the input is not too long
max_length=MAX_SEQ_LENGTH # Enforce the max length if necessary
).to("cuda")
# Generate the translation
outputs = model.generate(
**inputs,
max_new_tokens=512, # Limit the number of new tokens generated
eos_token_id=tokenizer.eos_token_id, # Ensure EOS token is used
pad_token_id=tokenizer.pad_token_id, # Ensure padding token is used
temperature=0.1, # Sampling temperature for diversity
top_p=1.0, # Sampling top-p for generation
use_cache=True # Use cache for efficiency
)
# Decode the generated tokens into text
output_string = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(f"Input: {sentence}")
print(f"Generated Translation: {output_string}")
Training Details
Training Data
We used the Sprotin parallel corpus for English–Faroese translation: barbaroo/Sprotin_parallel.
Training Procedure
Preprocessing [optional]
- Tokenization: We used the tokenizer from the base model
meta-llama/Llama-3.1-8B
. - The Alpaca prompt format was used, with Instruction, Input and Response.
Training Hyperparameters
- Epochs: 3 total, with an early stopping criterion monitoring validation loss.
- Batch Size: 2, with 4 Gradient accumulation steps
- Learning Rate: 2e-4
- Optimizer: AdamW with a linear learning-rate scheduler and warm-up.
Evaluation
Testing Data, Factors & Metrics
Testing Data
- The model was evaluated on the [FLORES-200] benchmark, of ~1012 English–Faroese pairs.
Metrics and Results
- BLEU: [0.175]
- chrF: [49.5]
- BERTScore f1: [0.948]
Human evaluation was also performed (see paper)
Citation []
[COMING SOON]
Framework versions
- PEFT 0.13.0
- Downloads last month
- 12
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for barbaroo/llama3.1_translate_8B
Base model
meta-llama/Llama-3.1-8B