File size: 4,283 Bytes
75d1d17 9d8b9cd 75d1d17 9d8b9cd 75d1d17 9d8b9cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
library_name: peft
base_model:
- beomi/Llama-3-Open-Ko-8B
license: mit
datasets:
- traintogpb/aihub-mmt-integrated-prime-base-300k
language:
- en
- ko
- ja
- zh
pipeline_tag: translation
---
### Pretrained LM
- [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)
### Training Dataset
- [traintogpb/aihub-mmt-integrated-prime-base-300k](https://huggingface.co/datasets/traintogpb/aihub-mmt-integrated-prime-base-300k)
- Can translate in Korean <-> English / Japanese / Chinese (Korean-centered translation)
### Prompt
- Template:
```python
# one of 'src_lang' and 'tgt_lang' should be "ํ๊ตญ์ด"
src_lang = "English" # English, ํ๊ตญ์ด, ๆฅๆฌ่ช, ไธญๆ
tgt_lang = "ํ๊ตญ์ด" # English, ํ๊ตญ์ด, ๆฅๆฌ่ช, ไธญๆ
text = "New era, same empire. T1 is your 2024 Worlds champion!"
# task part
task_xml_dict = {
'head': "<task>",
'body': f"Translate the source sentence from {src_lang} to {tgt_lang}.\nBe sure to reflect the guidelines below when translating.",
'tail': "</task>"
}
task = f"{task_xml_dict['head']}\n{task_xml_dict['body']}\n{task_xml_dict['tail']}"
# instruction part
instruction_xml_dict = {
'head': "<instruction>",
'body': ["Translate without any condition."],
'tail': "</instruction>"
}
instruction_xml_body = '\n'.join([f'- {body}' for body in instruction_xml_dict['body']])
instruction = f"{instruction_xml_dict['head']}\n{instruction_xml_body}\n{instruction_xml_dict['tail']}"
# translation part
src_xml_dict = {
'head': f"<source><{src_lang}>",
'body': text.strip(),
'tail': f"</{src_lang}></source>"
}
tgt_xml_dict = {
'head': f"<target><{LLAMA_LANG_TABLE[tgt_lang]}>",
}
src = f"{src_xml_dict['head']}\n{src_xml_dict['body']}\n{src_xml_dict['tail']}"
tgt = f"{tgt_xml_dict['head']}\n"
translation_xml_dict = {
'head': "<translation>",
'body': f"{src}\n{tgt}",
}
translation = f"{translation_xml_dict['head']}\n{translation_xml_dict['body']}"
# final prompt
prompt = f"{task}\n\n{instruction}\n\n{translation}"
```
- Example Input:
```
<task>
Translate the source sentence from English to ํ๊ตญ์ด.
Be sure to reflect the guidelines below when translating.
</task>
<instruction>
- Translate without any condition.
</instruction>
<translation>
<source><English>
New era, same empire. T1 is your 2024 Worlds champion!
</English></source>
<target><ํ๊ตญ์ด>
```
- Expected Output:
```
์๋ก์ด ์๋, ์ฌ์ ํ ์์กฐ. ํฐ์์ด 2024 ์์ฆ์ ์ฑํผ์ธ์
๋๋ค!
</ํ๊ตญ์ด></target>
</translation>
```
### Training
- Trained with LoRA adapter
- PLM: bfloat16
- Adapter: bfloat16
- Adapted to all the linear layers (around 2.05%)
### Usage (IMPORTANT)
- Should remove the EOS token at the end of the prompt.
```python
# MODEL
model_name = 'beomi/Llama-3-Open-Ko-8B'
adapter_name = 'traintogpb/llama-3-mmt-xml-it-sft-adapter'
model = AutoModelForCausalLM.from_pretrained(
model_name,
max_length=4000,
attn_implementation='flash_attention_2',
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(
model,
adapter_path=adapter_name,
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(adapter_name)
tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different
text = "New era, same empire. T1 is your 2024 Worlds champion!"
input_prompt = "<task> ~ <target><{tgt_lang}>" # prompt with the template above
inputs = tokenizer(input_prompt, max_length=2000, truncation=True, return_tensors='pt')
if inputs['input_ids'][0][-1] == tokenizer.eos_token_id:
inputs['input_ids'] = inputs['input_ids'][0][:-1].unsqueeze(dim=0)
inputs['attention_mask'] = inputs['attention_mask'][0][:-1].unsqueeze(dim=0)
outputs = model.generate(**inputs, max_length=2000, eos_token_id=tokenizer.eos_token_id)
input_len = len(inputs['input_ids'].squeeze())
translation = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)
print(translation)
```
### Framework versions
- PEFT 0.8.2 |