File size: 4,305 Bytes

---
library_name: peft
base_model:
- beomi/Llama-3-Open-Ko-8B
license: mit
datasets:
- traintogpb/aihub-mmt-integrated-prime-base-300k
language:
- en
- ko
- ja
- zh
pipeline_tag: translation
---
### Pretrained LM
- [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)

### Training Dataset
- [traintogpb/aihub-mmt-integrated-prime-base-300k](https://huggingface.co/datasets/traintogpb/aihub-mmt-integrated-prime-base-300k)
- Can translate in Korean <-> English / Japanese / Chinese (Korean-centered translation)

### Prompt
- Template:
  ```python
    # one of 'src_lang' and 'tgt_lang' should be "한국어"
    src_lang = "English" # English, 한국어, 日本語, 中文
    tgt_lang = "한국어" # English, 한국어, 日本語, 中文
    text = "New era, same empire. T1 is your 2024 Worlds champion!"
  
    # task part
    task_xml_dict = {
      'head': "<task>",
      'body': f"Translate the source sentence from {src_lang} to {tgt_lang}.\nBe sure to reflect the guidelines below when translating.",
      'tail': "</task>"
    }
    task = f"{task_xml_dict['head']}\n{task_xml_dict['body']}\n{task_xml_dict['tail']}"
  
    # instruction part
    instruction_xml_dict = {
      'head': "<instruction>",
      'body': ["Translate without any condition."],
      'tail': "</instruction>"
    }
    instruction_xml_body = '\n'.join([f'- {body}' for body in instruction_xml_dict['body']])
    instruction = f"{instruction_xml_dict['head']}\n{instruction_xml_body}\n{instruction_xml_dict['tail']}"
  
    # translation part
    src_xml_dict = {
      'head': f"<source><{src_lang}>",
      'body': text.strip(),
      'tail': f"</{src_lang}></source>"
    }
    tgt_xml_dict = {
      'head': f"<target><{tgt_lang}>",
    }
    src = f"{src_xml_dict['head']}\n{src_xml_dict['body']}\n{src_xml_dict['tail']}"
    tgt = f"{tgt_xml_dict['head']}\n"
    translation_xml_dict = {
      'head': "<translation>",
      'body': f"{src}\n{tgt}",
    }
    translation = f"{translation_xml_dict['head']}\n{translation_xml_dict['body']}"
  
    # final prompt
    prompt = f"{task}\n\n{instruction}\n\n{translation}"
  ```

- Example Input:
  ```
  <task>
  Translate the source sentence from English to 한국어.
  Be sure to reflect the guidelines below when translating.
  </task>

  <instruction>
  - Translate without any condition.
  </instruction>

  <translation>
  <source><English>
  New era, same empire. T1 is your 2024 Worlds champion!
  </English></source>
  <target><한국어>
  ```

- Expected Output:
  ```
  새로운 시대, 여전한 왕조. 티원이 2024 월즈의 챔피언입니다!
  </한국어></target>
  </translation>
  ```
  Model will generate the XML end tags.

### Training
- Trained with LoRA adapter
  - PLM: bfloat16
  - Adapter: bfloat16
  - Adapted to all the linear layers (around 2.05%)

### Usage (IMPORTANT)
- Should remove the EOS token at the end of the prompt.
  ```python
    # MODEL
    model_name = 'beomi/Llama-3-Open-Ko-8B'
    adapter_name = 'traintogpb/llama-3-mmt-xml-it-sft-adapter'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        max_length=4096,
        attn_implementation='flash_attention_2',
        torch_dtype=torch.bfloat16,
    )
    model = PeftModel.from_pretrained(
        model,
        adapter_path=adapter_name,
        torch_dtype=torch.bfloat16,
    )

    tokenizer = AutoTokenizer.from_pretrained(adapter_name)
    tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different

    text = "New era, same empire. T1 is your 2024 Worlds champion!"
    input_prompt = "<task> ~ <target><{tgt_lang}>" # prompt with the template above
    inputs = tokenizer(input_prompt, max_length=2000, truncation=True, return_tensors='pt')

    if inputs['input_ids'][0][-1] == tokenizer.eos_token_id:
        inputs['input_ids'] = inputs['input_ids'][0][:-1].unsqueeze(dim=0)
        inputs['attention_mask'] = inputs['attention_mask'][0][:-1].unsqueeze(dim=0)

    outputs = model.generate(**inputs, max_length=2000, eos_token_id=tokenizer.eos_token_id)

    input_len = len(inputs['input_ids'].squeeze())
    translation = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)
    print(translation)
  ```

### Framework versions

- PEFT 0.8.2