sappho192 commited on
Commit
bd96d3b
β€’
1 Parent(s): a547c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -8,6 +8,43 @@ pipeline_tag: translation
8
 
9
  # Japanese to Korean translator
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'.
12
  All data information can be accessed through 'AI-Hub ([aihub.or.kr](https://www.aihub.or.kr))'.
13
  (**In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required** with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. [Link](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105))
 
8
 
9
  # Japanese to Korean translator
10
 
11
+ Japanese to Korean translator model based on [EncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)([bert-japanese](https://huggingface.co/cl-tohoku/bert-base-japanese)+[kogpt2](https://github.com/SKT-AI/KoGPT2))
12
+
13
+ # Usage
14
+
15
+ ## Inference
16
+
17
+ ```Python
18
+ from transformers import(
19
+ EncoderDecoderModel,
20
+ PreTrainedTokenizerFast,
21
+ BertJapaneseTokenizer,
22
+ )
23
+
24
+ import torch
25
+
26
+ encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
27
+ decoder_model_name = "skt/kogpt2-base-v2"
28
+
29
+ src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
30
+ trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
31
+
32
+ model = EncoderDecoderModel.from_pretrained("sappho192/aihub-ja-ko-translator")
33
+
34
+ text = "εˆγ‚γΎγ—γ¦γ€‚γ‚ˆγ‚γ—γγŠι‘˜γ„γ—γΎγ™γ€‚"
35
+
36
+ def translate(text_src):
37
+ embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
38
+ embeddings = {k: v for k, v in embeddings.items()}
39
+ output = model.generate(**embeddings)[0, 1:-1]
40
+ text_trg = trg_tokenizer.decode(output.cpu())
41
+ return text_trg
42
+
43
+ print(translate(text))
44
+ ```
45
+
46
+ # Dataset
47
+
48
  This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'.
49
  All data information can be accessed through 'AI-Hub ([aihub.or.kr](https://www.aihub.or.kr))'.
50
  (**In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required** with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. [Link](https://aihub.or.kr/intrcn/guid/usagepolicy.do?currMenu=151&topMenu=105))