Model Card: English–Faroese Translation Adapter

Model Details

Model Description

  • Developed by: Barbara Scalvini
  • Model type: Language model adapter for English → Faroese translation
  • Language(s): English, Faroese
  • License: This adapter inherits the license from the original GPT-SW3 6.7B model.
  • Finetuned from model: AI-Sweden-Models/gpt-sw3-6.7b-v2
  • Library used: PEFT 0.13.0

Model Sources

  • Paper: [COMING SOON]

Uses

Direct Use

This adapter is intended to perform English→Faroese translation, leveraging a parameter-efficient fine-tuning (PEFT) approach.

Downstream Use [optional]

  • Can be integrated into broader multilingual or localization workflows.

Out-of-Scope Use

  • Any uses that rely on languages other than English or Faroese will likely yield suboptimal results.
  • Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.

Bias, Risks, and Limitations

  • Biases: The model could reflect biases present in the training data, such as historical or societal biases in English or Faroese texts.
  • Recommendation: Users should critically evaluate outputs, especially in sensitive or high-stakes applications.

How to Get Started with the Model

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import pandas as pd

ADAPTER_REPO = "barbaroo/gptsw3_translate_6.7B"
BASE_MODEL = "AI-Sweden-Models/gpt-sw3-6.7b-v2"

# 1. Load the tokenizer from the base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

model = AutoPeftModelForCausalLM.from_pretrained(
    ADAPTER_REPO,
    load_in_8bit=True,             # Optional: 8-bit quantization for GPU memory efficiency
    device_map="auto",             # Automatically spread layers across available GPUs
)

# Ensure the model is in evaluation mode
model.eval()

# Alpaca-style prompt template
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}
"""

# EOS token from the tokenizer
EOS_TOKEN = tokenizer.eos_token 
print(EOS_TOKEN)

sentences =  ['hello world']

translations = []

for sentence in sentences:
    # Tokenize the input sentence and prepare the prompt for each sentence
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Translate this sentence from English to Faroese:",  # instruction
                sentence,  # input sentence to translate
                "",  # output - leave blank for generation
            )
        ], 
        return_tensors="pt"
    ).to("cuda")

    # Generate the output
    outputs = model.generate(**inputs,
                             max_new_tokens=2000, 
                             eos_token_id=tokenizer.eos_token_id,  # Ensure EOS token is used
                             pad_token_id=tokenizer.pad_token_id,  # Ensure padding token is used
                             use_cache=True,
                             do_sample = True,
                             temperature = 0.1,
                             top_p=1)

    # Decode the generated tokens into a string
    output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
    #print(output_string)

    # Use a regular expression to extract the response part
    try:
        spl_word_1 = 'Response:\n'
        res =  output_string.split(spl_word_1, 1)
        response = res[1]
        translation = response.replace(EOS_TOKEN, '')
        translations.append(translation)

    except:
        translation = ''
        translations.append(translation)
        


    print(translation)

Training Details

Training Data

We used the Sprotin parallel corpus for English–Faroese translation: barbaroo/Sprotin_parallel.

Training Procedure

Preprocessing [optional]

  • Tokenization: We used the tokenizer from the base model AI-Sweden-Models/gpt-sw3-6.7b-v2.
  • The Alpaca prompt format was used, with Instruction, Input and Response.

Training Hyperparameters

  • Epochs: 3 total, with an early stopping criterion monitoring validation loss.
  • Batch Size: 2, with 4 Gradient accumulation steps
  • Learning Rate: 2e-4
  • Optimizer: AdamW with a linear learning-rate scheduler and warm-up.

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • The model was evaluated on the [FLORES-200] benchmark, of ~1012 English–Faroese pairs.

Metrics and Results

  • BLEU: [0.183]
  • chrF: [50.3]
  • BERTScore f1: [0.951]

Human evaluation was also performed (see paper)

Citation []

[COMING SOON]


Framework versions

  • PEFT 0.13.0
Downloads last month
22
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for barbaroo/gptsw3_translate_6.7B

Adapter
(2)
this model

Dataset used to train barbaroo/gptsw3_translate_6.7B