Zimix's picture
Update README.md
6959ce3
metadata
language:
  - zh
license: apache-2.0
tags:
  - roberta
  - NLU
  - NLI
  - Chinese
inference: true
widget:
  - text: 鲸鱼是哺乳动物,所有哺乳动物都是恒温动物[SEP]鲸鱼也是恒温动物
  - text: 葡萄树是阔叶植物,所有阔叶植物都不会落叶[SEP]葡萄树是落叶植物
  - text: 玉米价格持续上涨,饲料主要的来源是玉米[SEP]饲料价格可能会上涨

Erlangshen-Roberta-330M-Causal-Chinese

简介 Brief Introduction

基于chinese-roberta-wwm-ext-large模型继续训练得到的中文因果关系判别模型。

This is a Chinese causality discriminative model trained from chinese-roberta-wwm-ext-large.

模型分类 Model Taxonomy

需求 Demand 任务 Task 系列 Series 模型 Model 参数 Parameter 额外 Extra
通用 General 自然语言理解 NLU 二郎神 Erlangshen Roberta 330M 中文-因果关系推断 Chinese-Causal

模型信息 Model Information

数据准备 Corpus Preparation

  • 因果语料库:同Randeng-TransformerXL-5B-Deduction-Chinese,基于悟道语料库(280G版本),通过关联词匹配、人工标注 + GTSFactory筛选、数据清洗等步骤获取的具有因果关系的句子对

  • 重构NLI数据:对CMNLI、OCNLI数据集进行数据清洗,并将“蕴含”类别作为正例、其余类别作为负例转换为二分类数据集

  • 预热数据集:以重构后的CMNLI数据集为基础,引入因果语料库样本平衡正负例数量

  • Wudao Causal Corpus: Based on the Wudao corpus (280G version), sentence pairs with causality were obtained through logic indicator matching, manual annotation + GTSFactory, and data cleaning.

  • Reconstructed NLI data: After cleaning cmnli and ocnli dataset, we converted them into binary datasets by taking the "Entail" category as positive category and others as the negative.

  • Warm-up dataset: Based on the reconstructed cmnli dataset, the number of each category is balanced using data from the Wudao Causal Corpus.

模型训练 Model Training

  1. 基于chinese-roberta-wwm-ext-large模型,在预热数据集上微调预热
  2. 作为判别模型与Randeng-TransformerXL-5B-Deduction-Chinese模型和Randeng-TransformerXL-5B-Abduction-Chinese模型进行自洽(Self-consistent)闭环训练:

First, we fine-tuned chinese-roberta-wwm-ext-large model on our warm-up dataset. Then, we conducted self-consistent learning on this model, cooperating with Randeng-TransformerXL-5B-Deduction-Chinese model and Randeng-TransformerXL-5B-Abduction-Chinese model. Specifically, sentence pairs in the warm-up dataset were split and feed into Randeng-TransformerXL-5B-Deduction-Chinese model and Randeng-TransformerXL-5B-Abduction-Chinese model as premise and result respectively. Two generative models performed deductive reasoning and abductive reasoning based on each sample respectively, generating a large number of pseudo-samples; Erlangshen-Roberta-330M-Causal-Chinese scored the causality of the pseudo-samples and selected the training data for itself and the generative models in the next iteration.

模型效果 Performance for reference

Warmup dataset (dev) Warmup dataset (test) ocnli (Zero-shot)
F1-score 84.06 84.04 78.43
Precision 79.95 79.71 81.51
Recall 88.63 88.88 75.57

使用 Usage

from transformers import BertForSequenceClassification
from transformers import BertTokenizer
import torch
tokenizer=BertTokenizer.from_pretrained('Erlangshen-Roberta-330M-Causal-Chinese')
model=BertForSequenceClassification.from_pretrained('Erlangshen-Roberta-330M-Causal-Chinese')
texta='鲸鱼是哺乳动物,所有哺乳动物都是恒温动物'
textb='鲸鱼也是恒温动物'
output=model(torch.tensor([tokenizer.encode(texta,textb)]))
print(torch.nn.functional.softmax(output.logits,dim=-1))

引用 Citation

如果您在您的工作中使用了我们的模型,可以引用我们的论文

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

也可以引用我们的网站:

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}