|
--- |
|
language: |
|
- zh |
|
license: apache-2.0 |
|
tags: |
|
- t5 |
|
- text error correction |
|
widget: |
|
- text: "今天天气不太好,我的心情也不是很偷快" |
|
example_title: "案例1" |
|
- text: "能不能帮我买点淇淋,好久没吃了。" |
|
example_title: "案例2" |
|
- text: "脑子有点胡涂了,这道题冥冥学过还没有做出来" |
|
example_title: "案例3" |
|
inference: |
|
parameters: |
|
max_length: 256 |
|
num_beams: 10 |
|
no_repeat_ngram_size: 5 |
|
do_sample: True |
|
early_stopping: True |
|
--- |
|
|
|
## 功能介绍 |
|
|
|
T5Corrector:中文字音与字形纠错模型 |
|
|
|
这个模型是基于mengzi-t5-base进行文本纠错训练,使用2kw+句子,通过替换同音词、近音词和形近字来,对于句中词组随机添加词组、删除词组中的部分字,以及字词乱序操作构造纠错平行语料,共计2亿+句对,累计训练66000步。 |
|
|
|
<a href='https://github.com/Macielyoung/T5Corrector'>Github项目地址</a> |
|
|
|
|
|
|
|
加载模型: |
|
|
|
```python |
|
# 加载模型 |
|
from transformers import T5Tokenizer, T5ForConditionalGeneration |
|
pretrained = "Maciel/T5Corrector-base-v2" |
|
tokenizer = T5Tokenizer.from_pretrained(pretrained) |
|
model = T5ForConditionalGeneration.from_pretrained(pretrained) |
|
``` |
|
|
|
使用模型进行预测推理方法: |
|
```python |
|
import torch |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
def correct(text, max_length): |
|
model_inputs = tokenizer(text, |
|
max_length=max_length, |
|
truncation=True, |
|
return_tensors="pt").to(device) |
|
output = model.generate(**model_inputs, |
|
num_beams=5, |
|
no_repeat_ngram_size=4, |
|
do_sample=True, |
|
early_stopping=True, |
|
max_length=max_length, |
|
return_dict_in_generate=True, |
|
output_scores=True) |
|
pred_output = tokenizer.batch_decode(output.sequences, skip_special_tokens=True)[0] |
|
return pred_output |
|
|
|
text = "贵州毛台现在多少钱一瓶啊,想买两瓶尝尝味道。" |
|
correction = correct(text, max_length=32) |
|
print(correction) |
|
``` |
|
|
|
|
|
|
|
### 案例展示 |
|
|
|
``` |
|
示例1: |
|
input: 能不能帮我买点淇淋,好久没吃了。 |
|
output: 能不能帮我买点冰淇淋,好久没吃了。 |
|
|
|
示例2: |
|
input: 脑子有点胡涂了,这道题冥冥学过还没有做出来 |
|
output: 脑子有点糊涂了,这道题明明学过还没有做出来 |
|
|
|
示例3: |
|
input: 今天天气不太好,我的心情也不是很偷快 |
|
output: 今天天气不太好,我的心情也不是很愉快 |
|
``` |