File size: 3,839 Bytes
8464ab5
 
74b16e6
 
 
 
fcc71aa
8464ab5
74b16e6
 
 
bd96d3b
 
 
a959882
 
 
2bdd399
 
 
 
 
 
bd96d3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80ee466
bd96d3b
 
 
 
 
 
 
 
a547c2c
 
 
 
 
 
 
 
74b16e6
 
 
 
 
 
 
 
 
 
 
419eac0
74b16e6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: mit
language:
- ja
- ko
pipeline_tag: translation
inference: false
---

# Japanese to Korean translator

Japanese to Korean translator model based on [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)([bert-japanese](https://huggingface.co/cl-tohoku/bert-base-japanese)+[kogpt2](https://github.com/SKT-AI/KoGPT2))

# Usage
## Demo
Please visit https://huggingface.co/spaces/sappho192/aihub-ja-ko-translator-demo

## Dependencies (PyPI)

- torch
- transformers
- fugashi
- unidic-lite

## Inference

```Python
from transformers import(
    EncoderDecoderModel,
    PreTrainedTokenizerFast,
    BertJapaneseTokenizer,
)

import torch

encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"

src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)

model = EncoderDecoderModel.from_pretrained("sappho192/aihub-ja-ko-translator")

text = "εˆγ‚γΎγ—γ¦γ€‚γ‚ˆγ‚γ—γγŠι‘˜γ„γ—γΎγ™γ€‚"

def translate(text_src):
    embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
    embeddings = {k: v for k, v in embeddings.items()}
    output = model.generate(**embeddings, max_length=500)[0, 1:-1]
    text_trg = trg_tokenizer.decode(output.cpu())
    return text_trg

print(translate(text))
```

# Dataset

This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'.  
All data information can be accessed through 'AI-Hub ([aihub.or.kr](https://www.aihub.or.kr))'.  
(**In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required** with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. [Link](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105))  

이 λͺ¨λΈμ€ κ³Όν•™κΈ°μˆ μ •λ³΄ν†΅μ‹ λΆ€μ˜ μž¬μ›μœΌλ‘œ ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯μ›μ˜ 지원을 λ°›μ•„ κ΅¬μΆ•λœ 데이터셋을 ν™œμš©ν•˜μ—¬ μˆ˜ν–‰λœ μ—°κ΅¬μž…λ‹ˆλ‹€.  
λ³Έ λͺ¨λΈμ— ν™œμš©λœ λ°μ΄ν„°λŠ” AI ν—ˆλΈŒ([aihub.or.kr](https://www.aihub.or.kr))μ—μ„œ λ‹€μš΄λ‘œλ“œ λ°›μœΌμ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.  
(**ꡭ외에 μ†Œμž¬ν•˜λŠ” 법인, 단체 λ˜λŠ” 개인이 AI데이터 등을 μ΄μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ”** μˆ˜ν–‰κΈ°κ΄€ λ“± 및 ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯원과 λ³„λ„λ‘œ ν•©μ˜κ°€ ν•„μš”ν•©λ‹ˆλ‹€.  
**λ³Έ AI데이터 λ“±μ˜ κ΅­μ™Έ λ°˜μΆœμ„ μœ„ν•΄μ„œλŠ”** μˆ˜ν–‰κΈ°κ΄€ λ“± 및 ν•œκ΅­μ§€λŠ₯μ •λ³΄μ‚¬νšŒμ§„ν₯원과 λ³„λ„λ‘œ ν•©μ˜κ°€ ν•„μš”ν•©λ‹ˆλ‹€. [[좜처](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105)])

## Dataset list

The dataset used to train the model is merged following sub-datasets:  

- 027. μΌμƒμƒν™œ 및 ꡬ어체 ν•œ-쀑, ν•œ-일 λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜ 데이터 [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=546)]
- 053. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄(μ˜μ–΄ μ œμ™Έ) λ²ˆμ—­ λ§λ­‰μΉ˜(κΈ°μˆ κ³Όν•™) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71493)]
- 054. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄ λ²ˆμ—­ λ§λ­‰μΉ˜(κΈ°μ΄ˆκ³Όν•™) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71496)]
- 055. ν•œκ΅­μ–΄-λ‹€κ΅­μ–΄ λ²ˆμ—­ λ§λ­‰μΉ˜ (인문학) [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=71498)]
- ν•œκ΅­μ–΄-일본어 λ²ˆμ—­ λ§λ­‰μΉ˜ [[Link](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&dataSetSn=127)]

To reproduce the the merged dataset, you can use the code in below link:  
https://github.com/sappho192/aihub-translation-dataset