|
--- |
|
library_name: transformers |
|
license: llama3.1 |
|
language: |
|
- ko |
|
- vi |
|
- id |
|
- km |
|
- th |
|
metrics: |
|
- bleu |
|
- rouge |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model is a multilingual translation model fine-tuned on LLaMA 3.1 Instruct base model. It enables mutual translation between the following Southeast Asian languages: |
|
|
|
- Korean |
|
- Vietnamese |
|
- Indonesian |
|
- Cambodian (Khmer) |
|
- Thai |
|
|
|
|
|
## Model Details |
|
The model is designed for translating short text segments between any pair of the supported languages. |
|
|
|
Supported language pairs: |
|
|
|
- Korean β Vietnamese |
|
- Korean β Indonesian |
|
- Korean β Cambodian |
|
- Korean β Thai |
|
- Vietnamese β Indonesian |
|
- Vietnamese β Cambodian |
|
- Vietnamese β Thai |
|
- Indonesian β Cambodian |
|
- Indonesian β Thai |
|
- Cambodian β Thai |
|
|
|
### Model Description |
|
|
|
This model is specifically optimized for Southeast Asian language translation needs, focusing on enabling communication between these specific language communities. |
|
|
|
The extensive training data of 20M examples (1M for each translation direction) provides a robust foundation for handling common expressions and basic conversations across these languages. |
|
|
|
### Model Architecture |
|
|
|
Base Model: meta-llama/Llama-3.1-8B-Instruct |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- Performance is limited to short sentences and phrases |
|
- May not handle complex or lengthy text effectively |
|
- Translation quality may vary depending on language pair and content complexity |
|
|
|
## Evaluation results |
|
|
|
| Source Language | Target Language | BLEU Score | ROUGE-1 | ROUGE-L | |
|
|----------------|-----------------|------------|---------|---------| |
|
| Korean | Vietnamese | 56.70 | 81.64 | 76.66 | |
|
| Korean | Cambodian | 71.69 | 89.26 | 88.20 | |
|
| Korean | Indonesian | 58.32 | 80.39 | 76.63 | |
|
| Korean | Thai | 63.26 | 78.88 | 72.29 | |
|
| Vietnamese | Korean | 49.01 | 75.57 | 72.74 | |
|
| Vietnamese | Cambodian | 78.26 | 90.74 | 90.32 | |
|
| Vietnamese | Indonesian | 65.96 | 83.08 | 81.46 | |
|
| Vietnamese | Thai | 65.93 | 81.09 | 76.57 | |
|
| Cambodian | Korean | 49.10 | 72.67 | 69.75 | |
|
| Cambodian | Vietnamese | 63.42 | 81.56 | 79.09 | |
|
| Cambodian | Indonesian | 61.41 | 79.67 | 77.75 | |
|
| Cambodian | Thai | 70.91 | 81.85 | 77.66 | |
|
| Indonesian | Korean | 53.61 | 77.14 | 74.29 | |
|
| Indonesian | Vietnamese | 68.21 | 85.41 | 83.10 | |
|
| Indonesian | Cambodian | 78.84 | 90.81 | 90.35 | |
|
| Indonesian | Thai | 67.12 | 81.54 | 77.19 | |
|
| Thai | Korean | 45.59 | 72.48 | 69.46 | |
|
| Thai | Vietnamese | 61.55 | 81.01 | 78.24 | |
|
| Thai | Cambodian | 78.52 | 91.47 | 91.16 | |
|
| Thai | Indonesian | 58.99 | 78.56 | 76.40 | |
|
|
|
## Example |
|
|
|
```py |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"MLP-KTLim/llama-3.1-8B-Asian-Translator", |
|
torch_dtype="auto", |
|
device_map="auto", |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"MLP-KTLim/llama-3.1-Asian-Bllossom-3B-Translator", |
|
) |
|
|
|
input_text = "μλ
νμΈμ? μμμ μΈμ΄ λ²μ λͺ¨λΈ μ
λλ€." |
|
|
|
def get_input_ids(source_lang, target_lang, message): |
|
assert source_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"] |
|
assert target_lang in ["Korean", "Vietnamese", "Indonesian", "Thai", "Cambodian"] |
|
|
|
input_ids = tokenizer.apply_chat_template( |
|
conversation=[ |
|
{"role": "system", "content": f"You are a useful translation AI. Please translate the sentence given in {source_lang} into {target_lang}."}, |
|
{"role": "user", "content": message}, |
|
], |
|
tokenize=True, |
|
return_tensors="pt", |
|
add_generation_prompt=True, |
|
) |
|
return input_ids |
|
|
|
input_ids = get_input_ids( |
|
source_lang="Korean", |
|
target_lang="Vietnamese", |
|
message=input_text, |
|
) |
|
|
|
output = model.generate( |
|
input_ids.to(model.device), |
|
max_new_tokens=128, |
|
) |
|
|
|
print(tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)) |
|
``` |
|
|
|
|
|
## Contributor |
|
- μμΈνΈ (wih1226@seoultech.ac.kr) |
|
- κΉλ―Όμ€ (mjkmain@seoultech.ac.kr) |
|
|
|
|