shibing624 commited on
Commit
b2d8e3b
·
1 Parent(s): d881a76

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ tags:
5
+ - bert
6
+ license: "apache-2.0"
7
+ ---
8
+
9
+ # Please use 'Bert' related functions to load this model!
10
+
11
+
12
+ `macbert4csc-base-chinese` evaluate sighan15:
13
+
14
+ Sentence Level: acc:0.825492, precision:0.993085, recall:0.825376, f1:0.901497
15
+
16
+
17
+ ## Usage
18
+
19
+ 本项目开源在中文文本纠错项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持BERT模型,可通过如下命令调用:
20
+
21
+ ```python
22
+ from pycorrector.macbert.macbert_corrector import MacBertCorrector
23
+
24
+ nlp = MacBertCorrector("shibing624/macbert4csc-base-chinese").macbert_correct
25
+
26
+ i = nlp('今天新情很好')
27
+ print(i)
28
+ ```
29
+
30
+ 当然,你也可使用官方的huggingface/transformers调用:
31
+
32
+ ```python
33
+ import torch
34
+ from transformers import BertTokenizer, BertForMaskedLM
35
+
36
+ tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
37
+ model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
38
+
39
+ texts = ["今天心情很好", "你找到你最喜欢的工作,我也很高心。"]
40
+ outputs = model(**tokenizer(texts, padding=True, return_tensors='pt'))
41
+ corrected_texts = []
42
+ for ids, text in zip(outputs.logits, texts):
43
+ _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
44
+ corrected_texts.append(_text[:len(text)])
45
+
46
+ print(corrected_texts)
47
+ ```
48
+
49
+ ### 训练数据集
50
+ #### SIGHAN+Wang271K中文纠错数据集
51
+
52
+
53
+ | 数据集 | 语料 | 下载链接 | 压缩包大小 |
54
+ | :------- | :--------- | :---------: | :---------: |
55
+ | **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条) | [百度网盘(密码01b9)](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ)| 106M |
56
+ | **`原始SIGHAN数据集`** | SIGHAN13 14 15 | [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)| 339K |
57
+ | **`原始Wang271K数据集`** | Wang271K | [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)| 93M |
58
+
59
+
60
+ SIGHAN+Wang271K中文纠错数据集,数据格式:
61
+ ```json
62
+ [
63
+ {
64
+ "id": "B2-4029-3",
65
+ "original_text": "晚间会听到嗓音,白天的时候大家都不会太在意,但是在睡觉的时候这嗓音成为大家的恶梦。",
66
+ "wrong_ids": [
67
+ 5,
68
+ 31
69
+ ],
70
+ "correct_text": "晚间会听到噪音,白天的时候大家都不会太在意,但是在睡觉的时候这噪音成为大家的恶梦。"
71
+ },
72
+ ]
73
+ ```
74
+
75
+ ```shell
76
+ macbert4csc
77
+ ├── config.json
78
+ ├── pytorch_model.bin
79
+ ├── special_tokens_map.json
80
+ ├── tokenizer_config.json
81
+ └── vocab.txt
82
+ ```
83
+
84
+ 如果需要训练macbert4csc,请参考[https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert](https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert)
85
+
86
+
87
+ ### About MacBERT
88
+ **MacBERT** is an improved BERT with novel **M**LM **a**s **c**orrection pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.
89
+
90
+ Here is an example of our pre-training task.
91
+ | | Example |
92
+ | -------------- | ----------------- |
93
+ | **Original Sentence** | we use a language model to predict the probability of the next word. |
94
+ | **MLM** | we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word . |
95
+ | **Whole word masking** | we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word . |
96
+ | **N-gram masking** | we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word . |
97
+ | **MLM as correction** | we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word . |
98
+
99
+ Except for the new pre-training task, we also incorporate the following techniques.
100
+
101
+ - Whole Word Masking (WWM)
102
+ - N-gram masking
103
+ - Sentence-Order Prediction (SOP)
104
+
105
+ **Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.**
106
+
107
+ For more technical details, please check our paper: [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
108
+
109
+
110
+ ## Citation
111
+
112
+ ```latex
113
+ @software{pycorrector,
114
+ author = {Xu Ming},
115
+ title = {pycorrector: Text Error Correction Tool},
116
+ year = {2020},
117
+ url = {https://github.com/shibing624/pycorrector},
118
+ }
119
+ ```