QWEN2.5-7B-Bnk-3e / README.md
FINGU-AI's picture
Update README.md
41593ad verified
metadata
language:
  - ko
  - uz
  - en
  - ru
  - zh
  - ja
  - km
  - my
  - si
  - tl
  - th
  - vi
  - uz
  - bn
  - mn
  - id
  - ne
  - pt
tags:
  - translation
  - multilingual
  - korean
  - uzbek
datasets:
  - custom_parallel_corpus
license: mit

QWEN2.5-7B-Bnk-7e

Model Description

QWEN2.5-7B-Bnk-5e is a multilingual translation model based on the QWEN 2.5 architecture with 7 billion parameters. It specializes in translating multiple languages to Korean and Uzbek.

Intended Uses & Limitations

The model is designed for translating text from various Asian and European languages to Korean and Uzbek. It can be used for tasks such as:

  • Multilingual document translation
  • Cross-lingual information retrieval
  • Language learning applications
  • International communication assistance

Please note that while the model strives for accuracy, it may not always produce perfect translations, especially for idiomatic expressions or highly context-dependent content.

Training and Evaluation Data

The model was fine-tuned on a diverse dataset of parallel texts covering the supported languages. Evaluation was performed on held-out test sets for each language pair.

Training Procedure

Fine-tuning was performed on the QWEN 2.5 7B base model using custom datasets for the specific language pairs.

Supported Languages

The model supports translation from the following languages to Korean and Uzbek:

  • uzbek (uz)
  • Russian (ru)
  • Thai (th)
  • Chinese (Simplified) (zh)
  • Chinese (Traditional) (zh-tw, zh-hant)
  • Bengali (bn)
  • Mongolian (mn)
  • Indonesian (id)
  • Nepali (ne)
  • English (en)
  • Khmer (km)
  • Portuguese (pt)
  • Sinhala (si)
  • Korean (ko)
  • Tagalog (tl)
  • Myanar (my)
  • Vietnamese (vi)
  • Japanese (ja)

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "FINGU-AI/QWEN2.5-7B-Bnk-5e"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example usage
source_text = "Hello, how are you?"
source_lang = "en"
target_lang = "ko"  # or "uz" for Uzbek

messages = [
        {"role": "system", "content": f"""Translate {input_lang} to {output_lang} word by word correctly."""},
        {"role": "user", "content": f"""{source_text}"""},
    ]
# Apply chat template
input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to('cuda')

outputs = model.generate(input_ids, max_length=100)
response = outputs[0][input_ids.shape[-1]:]
translated_text = tokenizer.decode(response, skip_special_tokens=True)
print(translated_text)

Performance

Limitations

  • The model's performance may vary across different language pairs and domains.
  • It may struggle with very colloquial or highly specialized text.
  • The model may not always capture cultural nuances or context-dependent meanings accurately.

Ethical Considerations

  • The model should not be used for generating or propagating harmful, biased, or misleading content.
  • Users should be aware of potential biases in the training data that may affect translations.
  • The model's outputs should not be considered as certified translations for official or legal purposes without human verification.

Citation

@misc{fingu2023qwen25,
  author = {FINGU AI and AI Team},
  title = {QWEN2.5-7B-Bnk-7e: A Multilingual Translation Model},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/FINGU-AI/QWEN2.5-7B-Bnk-5e}}
}