|
--- |
|
language: |
|
- ko |
|
- uz |
|
- en |
|
- ru |
|
- zh |
|
- ja |
|
- km |
|
- my |
|
- si |
|
- tl |
|
- th |
|
- vi |
|
- kk |
|
- bn |
|
- mn |
|
- id |
|
- ne |
|
- pt |
|
tags: |
|
- translation |
|
- multilingual |
|
- korean |
|
- uzbek |
|
datasets: |
|
- custom_parallel_corpus |
|
license: mit |
|
--- |
|
|
|
# QWEN2.5-7B-Bnk-7e |
|
|
|
## Model Description |
|
|
|
QWEN2.5-7B-Bnk-5e is a multilingual translation model based on the QWEN 2.5 architecture with 7 billion parameters. It specializes in translating multiple languages to Korean and Uzbek. |
|
|
|
## Intended Uses & Limitations |
|
|
|
The model is designed for translating text from various Asian and European languages to Korean and Uzbek. It can be used for tasks such as: |
|
|
|
- Multilingual document translation |
|
- Cross-lingual information retrieval |
|
- Language learning applications |
|
- International communication assistance |
|
|
|
Please note that while the model strives for accuracy, it may not always produce perfect translations, especially for idiomatic expressions or highly context-dependent content. |
|
|
|
## Training and Evaluation Data |
|
|
|
The model was fine-tuned on a diverse dataset of parallel texts covering the supported languages. Evaluation was performed on held-out test sets for each language pair. |
|
|
|
## Training Procedure |
|
|
|
Fine-tuning was performed on the QWEN 2.5 7B base model using custom datasets for the specific language pairs. |
|
|
|
## Supported Languages |
|
|
|
The model supports translation from the following languages to Korean and Uzbek: |
|
|
|
- Kazakh (kk) |
|
- Russian (ru) |
|
- Thai (th) |
|
- Chinese (Simplified) (zh) |
|
- Chinese (Traditional) (zh-tw, zh-hant) |
|
- Bengali (bn) |
|
- Mongolian (mn) |
|
- Indonesian (id) |
|
- Nepali (ne) |
|
- English (en) |
|
- Khmer (km) |
|
- Portuguese (pt) |
|
- Sinhala (si) |
|
- Korean (ko) |
|
- Tagalog (tl) |
|
- Burmese (my) |
|
- Vietnamese (vi) |
|
- Japanese (ja) |
|
|
|
|
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_name = "FINGU-AI/QWEN2.5-7B-Bnk-5e" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Example usage |
|
source_text = "Hello, how are you?" |
|
source_lang = "en" |
|
target_lang = "ko" # or "uz" for Uzbek |
|
|
|
messages = [ |
|
{"role": "system", "content": f"""Translate {input_lang} to {output_lang} word by word correctly."""}, |
|
{"role": "user", "content": f"""{source_text}"""}, |
|
] |
|
# Apply chat template |
|
input_ids = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
return_tensors="pt" |
|
).to('cuda') |
|
|
|
outputs = model.generate(input_ids, max_length=100) |
|
response = outputs[0][input_ids.shape[-1]:] |
|
translated_text = tokenizer.decode(response, skip_special_tokens=True) |
|
print(translated_text) |
|
``` |
|
## Performance |
|
|
|
|
|
## Limitations |
|
|
|
- The model's performance may vary across different language pairs and domains. |
|
- It may struggle with very colloquial or highly specialized text. |
|
- The model may not always capture cultural nuances or context-dependent meanings accurately. |
|
|
|
## Ethical Considerations |
|
|
|
- The model should not be used for generating or propagating harmful, biased, or misleading content. |
|
- Users should be aware of potential biases in the training data that may affect translations. |
|
- The model's outputs should not be considered as certified translations for official or legal purposes without human verification. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
@misc{fingu2023qwen25, |
|
author = {FINGU AI and AI Team}, |
|
title = {QWEN2.5-7B-Bnk-7e: A Multilingual Translation Model}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face Model Hub}, |
|
howpublished = {\url{https://huggingface.co/FINGU-AI/QWEN2.5-7B-Bnk-5e}} |
|
} |
|
|