|
--- |
|
tags: |
|
- text2text-generation |
|
- Chinese |
|
- seq2seq |
|
- grammar |
|
language: zh |
|
license: apache-2.0 |
|
--- |
|
# Pseudo-Native-BART-CGEC |
|
|
|
This model is a cutting-edge CGEC model based on [Chinese BART-large](https://huggingface.co/fnlp/bart-large-chinese). |
|
It is trained with about 100M pseudo native speaker CGEC training data generated by heuristic rules and human-annotated training data for the media domain. |
|
More details can be found in our [Github](https://github.com/HillZhang1999/NaSGEC) and the [paper](https://arxiv.org/pdf/2305.16023.pdf). |
|
|
|
## Usage |
|
|
|
pip install transformers |
|
|
|
``` |
|
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline |
|
tokenizer = BertTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media") |
|
model = BartForConditionalGeneration.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media") |
|
encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True) |
|
if "token_type_ids" in encoded_input: |
|
del encoded_input["token_type_ids"] |
|
output = model.generate(**encoded_input) |
|
print(tokenizer.batch_decode(output, skip_special_tokens=True)) |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@inproceedings{zhang-etal-2023-nasgec, |
|
title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts", |
|
author = "Zhang, Yue and |
|
Zhang, Bo and |
|
Jiang, Haochen and |
|
Li, Zhenghua and |
|
Li, Chen and |
|
Huang, Fei and |
|
Zhang, Min" |
|
booktitle = "Findings of ACL", |
|
year = "2023" |
|
} |
|
``` |