Files changed (1) hide show
  1. README.md +64 -54
README.md CHANGED
@@ -11,87 +11,97 @@ license: apache-2.0
11
 
12
  ### zho-eng
13
 
14
- * source group: Chinese
15
- * target group: English
16
- * OPUS readme: [zho-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/zho-eng/README.md)
17
-
18
- * model: transformer
19
- * source language(s): cjy_Hans cjy_Hant cmn cmn_Hans cmn_Hant gan lzh lzh_Hans nan wuu yue yue_Hans yue_Hant
20
- * target language(s): eng
21
- * model: transformer
22
- * pre-processing: normalization + SentencePiece (spm32k,spm32k)
23
- * download original weights: [opus-2020-07-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.zip)
24
- * test set translations: [opus-2020-07-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.test.txt)
25
- * test set scores: [opus-2020-07-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.eval.txt)
26
-
27
- ## Benchmarks
28
-
29
- | testset | BLEU | chr-F |
30
- |-----------------------|-------|-------|
31
- | Tatoeba-test.zho.eng | 36.1 | 0.548 |
32
-
33
 
34
- ### System Info:
35
- - hf_name: zho-eng
 
 
 
 
 
 
 
 
36
 
37
- - source_languages: zho
38
 
39
- - target_languages: eng
40
 
41
- - opus_readme_url: https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/zho-eng/README.md
42
 
43
- - original_repo: Tatoeba-Challenge
44
 
45
- - tags: ['translation']
46
 
47
- - languages: ['zh', 'en']
48
 
49
- - src_constituents: {'cmn_Hans', 'nan', 'nan_Hani', 'gan', 'yue', 'cmn_Kana', 'yue_Hani', 'wuu_Bopo', 'cmn_Latn', 'yue_Hira', 'cmn_Hani', 'cjy_Hans', 'cmn', 'lzh_Hang', 'lzh_Hira', 'cmn_Hant', 'lzh_Bopo', 'zho', 'zho_Hans', 'zho_Hant', 'lzh_Hani', 'yue_Hang', 'wuu', 'yue_Kana', 'wuu_Latn', 'yue_Bopo', 'cjy_Hant', 'yue_Hans', 'lzh', 'cmn_Hira', 'lzh_Yiii', 'lzh_Hans', 'cmn_Bopo', 'cmn_Hang', 'hak_Hani', 'cmn_Yiii', 'yue_Hant', 'lzh_Kana', 'wuu_Hani'}
50
 
51
- - tgt_constituents: {'eng'}
52
 
53
- - src_multilingual: False
54
 
55
- - tgt_multilingual: False
56
 
57
- - prepro: normalization + SentencePiece (spm32k,spm32k)
 
 
 
 
 
 
58
 
59
- - url_model: https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.zip
60
-
61
- - url_test_set: https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.test.txt
62
-
63
- - src_alpha3: zho
 
64
 
65
- - tgt_alpha3: eng
66
 
67
- - short_pair: zh-en
68
 
69
- - chrF2_score: 0.5479999999999999
70
 
71
- - bleu: 36.1
72
 
73
- - brevity_penalty: 0.948
74
 
75
- - ref_len: 82826.0
76
 
77
- - src_name: Chinese
78
 
79
- - tgt_name: English
80
 
81
- - train_date: 2020-07-17
 
 
82
 
83
- - src_alpha2: zh
84
 
85
- - tgt_alpha2: en
 
 
 
 
 
 
 
 
86
 
87
- - prefer_old: False
88
 
89
- - long_pair: zho-eng
 
90
 
91
- - helsinki_git_sha: 480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
92
 
93
- - transformers_git_sha: 2207e5d8cb224e954a7cba69fa4ac2309e9ff30b
 
94
 
95
- - port_machine: brutasse
96
 
97
- - port_time: 2020-08-21-14:41
 
11
 
12
  ### zho-eng
13
 
14
+ ## Table of Contents
15
+ - [Model Details](#model-details)
16
+ - [Uses](#uses)
17
+ - [Risks, Limitations and Biases](#risks-limitations-and-biases)
18
+ - [Training](#training)
19
+ - [Evaluation](#evaluation)
20
+ - [Citation Information](#citation-information)
21
+ - [How to Get Started With the Model](#how-to-get-started-with-the-model)
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ## Model Details
24
+ - **Model Description:**
25
+ - **Developed by:** Language Technology Research Group at the University of Helsinki
26
+ - **Model Type:** Translation
27
+ - **Language(s):**
28
+ - Source Language: Chinese
29
+ - Target Language: English
30
+ - **License:** Apache-2.0
31
+ - **Resources for more information:**
32
+ - [GitHub Repo](https://github.com/Helsinki-NLP/OPUS-MT-train)
33
 
 
34
 
35
+ ## Uses
36
 
37
+ #### Direct Use
38
 
39
+ This model can be used for translation and text-to-text generation.
40
 
 
41
 
42
+ ## Risks, Limitations and Biases
43
 
44
+ **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
45
 
46
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
47
 
48
+ Further details about the dataset for this model can be found in the OPUS readme: [zho-eng](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/zho-eng/README.md)
49
 
50
+ ## Training
51
 
52
+ #### System Information
53
+ * helsinki_git_sha: 480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
54
+ * transformers_git_sha: 2207e5d8cb224e954a7cba69fa4ac2309e9ff30b
55
+ * port_machine: brutasse
56
+ * port_time: 2020-08-21-14:41
57
+ * src_multilingual: False
58
+ * tgt_multilingual: False
59
 
60
+ #### Training Data
61
+ ##### Preprocessing
62
+ * pre-processing: normalization + SentencePiece (spm32k,spm32k)
63
+ * ref_len: 82826.0
64
+ * dataset: [opus](https://github.com/Helsinki-NLP/Opus-MT)
65
+ * download original weights: [opus-2020-07-17.zip](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.zip)
66
 
67
+ * test set translations: [opus-2020-07-17.test.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.test.txt)
68
 
 
69
 
70
+ ## Evaluation
71
 
72
+ #### Results
73
 
74
+ * test set scores: [opus-2020-07-17.eval.txt](https://object.pouta.csc.fi/Tatoeba-MT-models/zho-eng/opus-2020-07-17.eval.txt)
75
 
76
+ * brevity_penalty: 0.948
77
 
 
78
 
79
+ ## Benchmarks
80
 
81
+ | testset | BLEU | chr-F |
82
+ |-----------------------|-------|-------|
83
+ | Tatoeba-test.zho.eng | 36.1 | 0.548 |
84
 
85
+ ## Citation Information
86
 
87
+ ```bibtex
88
+ @InProceedings{TiedemannThottingal:EAMT2020,
89
+ author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
90
+ title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
91
+ booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
92
+ year = {2020},
93
+ address = {Lisbon, Portugal}
94
+ }
95
+ ```
96
 
97
+ ## How to Get Started With the Model
98
 
99
+ ```python
100
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
101
 
102
+ tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
103
 
104
+ model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
105
+ ```
106
 
 
107