metadata

library_name: transformers
license: llama3.1
language:
  - ko
  - vi
  - id
  - km
  - th
metrics:
  - bleu
  - rouge
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

Model Card for Model ID

This model is a multilingual translation model fine-tuned on LLaMA 3.1 Instruct base model. It enables mutual translation between the following Southeast Asian languages:

Korean
Vietnamese
Indonesian
Cambodian (Khmer)
Thai

Model Details

The model is designed for translating short text segments between any pair of the supported languages.

Supported language pairs:

Korean ↔ Vietnamese
Korean ↔ Indonesian
Korean ↔ Cambodian
Korean ↔ Thai
Vietnamese ↔ Indonesian
Vietnamese ↔ Cambodian
Vietnamese ↔ Thai
Indonesian ↔ Cambodian
Indonesian ↔ Thai
Cambodian ↔ Thai

Model Description

This model is specifically optimized for Southeast Asian language translation needs, focusing on enabling communication between these specific language communities.

The extensive training data of 20M examples (1M for each translation direction) provides a robust foundation for handling common expressions and basic conversations across these languages.

Model Architecture

Base Model: meta-llama/Llama-3.1-8B-Instruct

Bias, Risks, and Limitations

Performance is limited to short sentences and phrases
May not handle complex or lengthy text effectively
Translation quality may vary depending on language pair and content complexity

Evaluation results

Source Language	Target Language	BLEU Score	ROUGE-1	ROUGE-L
Korean	Vietnamese	56.70	81.64	76.66
Korean	Cambodian	71.69	89.26	88.20
Korean	Indonesian	58.32	80.39	76.63
Korean	Thai	63.26	78.88	72.29
Vietnamese	Korean	49.01	75.57	72.74
Vietnamese	Cambodian	78.26	90.74	90.32
Vietnamese	Indonesian	65.96	83.08	81.46
Vietnamese	Thai	65.93	81.09	76.57
Cambodian	Korean	49.10	72.67	69.75
Cambodian	Vietnamese	63.42	81.56	79.09
Cambodian	Indonesian	61.41	79.67	77.75
Cambodian	Thai	70.91	81.85	77.66
Indonesian	Korean	53.61	77.14	74.29
Indonesian	Vietnamese	68.21	85.41	83.10
Indonesian	Cambodian	78.84	90.81	90.35
Indonesian	Thai	67.12	81.54	77.19
Thai	Korean	45.59	72.48	69.46
Thai	Vietnamese	61.55	81.01	78.24
Thai	Cambodian	78.52	91.47	91.16
Thai	Indonesian	58.99	78.56	76.40

Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MLP-KTLim/llama-3.1-8B-Asian-Translator",
    torch_dtype="auto",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "MLP-KTLim/llama-3.1-Asian-Bllossom-3B-Translator",
)

input_text = "안녕하세요? 아시아 언어 번역 모델 입니다."

def get_input_ids(source_lang, target_lang, message):
    assert source_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    assert target_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"]
    
    input_ids = tokenizer.apply_chat_template(
        conversation=[
            {"role": "system", "content": f"You are a useful translation AI. Please translate the sentence given in {source_lang} into {target_lang}."},
            {"role": "user", "content": message},
        ],
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True,
    )
    return input_ids

input_ids = get_input_ids(
    source_lang="Korean",
    target_lang="Vietnamese",
    message=input_text,
)

output = model.generate(
    input_ids.to(model.device),
    max_new_tokens=128,
)

print(tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True))

Contributor

원인호 (wih1226@seoultech.ac.kr)
김민준 (mjkmain@seoultech.ac.kr)