中文预训练Longformer模型 | Longformer_ZH with PyTorch
相比于Transformer的O(n^2)复杂度,Longformer提供了一种以线性复杂度处理最长4K字符级别文档序列的方法。Longformer Attention包括了标准的自注意力与全局注意力机制,方便模型更好地学习超长序列的信息。
Compared with O(n^2) complexity for Transformer model, Longformer provides an efficient method for processing long-document level sequence in Linear complexity. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
我们注意到关于中文Longformer或超长序列任务的资源较少,因此在此开源了我们预训练的中文Longformer模型参数, 并提供了相应的加载方法,以及预训练脚本。
There are not so much resource for Chinese Longformer or long-sequence-level chinese task. Thus we open source our pretrained longformer model to help the researchers.
加载模型 | Load the model
您可以使用谷歌云盘或百度网盘下载我们的模型
You could get Longformer_zh from Google Drive or Baidu Yun.
- Google Drive: https://drive.google.com/file/d/1IDJ4aVTfSFUQLIqCYBtoRpnfbgHPoxB4/view?usp=sharing
- 百度云: 链接:https://pan.baidu.com/s/1HaVDENx52I7ryPFpnQmq1w 提取码:y601
我们同样提供了Huggingface的自动下载
We also provide auto load with HuggingFace.Transformers.
from Longformer_zh import LongformerZhForMaksedLM
LongformerZhForMaksedLM.from_pretrained('ValkyriaLenneth/longformer_zh')
注意事项 | Notice
直接使用
transformers.LongformerModel.from_pretrained
加载模型Please use
transformers.LongformerModel.from_pretrained
to load the model directly以下内容已经被弃用
The following notices are abondoned, please ignore them.
区别于英文原版Longformer, 中文Longformer的基础是Roberta_zh模型,其本质上属于
Transformers.BertModel
而非RobertaModel
, 因此无法使用原版代码直接加载。Different with origin English Longformer, Longformer_Zh is based on Roberta_zh which is a subclass of
Transformers.BertModel
notRobertaModel
. Thus it is impossible to load it with origin code.我们提供了修改后的中文Longformer文件,您可以使用其加载参数。
We provide modified Longformer_zh class, you can use it directly to load the model.
如果您想将此参数用于更多任务,请参考
Longformer_zh.py
替换Attention Layer.If you want to use our model on more down-stream tasks, please refer to
Longformer_zh.py
and replace Attention layer with Longformer Attention layer.
关于预训练 | About Pretraining
我们的预训练语料来自 https://github.com/brightmart/nlp_chinese_corpus, 根据Longformer原文的设置,采用了多种语料混合的预训练数据。
The corpus of pretraining is from https://github.com/brightmart/nlp_chinese_corpus. Based on the paper of Longformer, we use a mixture of 4 different chinese corpus for pretraining.
我们的模型是基于Roberta_zh_mid (https://github.com/brightmart/roberta_zh),训练脚本参考了https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb
The basement of our model is Roberta_zh_mid (https://github.com/brightmart/roberta_zh). Pretraining scripts is modified from https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb.
同时我们在原版基础上,引入了
Whole-Word-Masking
机制,以便更好地适应中文特性。We introduce
Whole-Word-Masking
method into pretraining for better fitting Chinese language.Whole-Word-Masking
代码改写自TensorFlow版本的Roberta_zh,据我们所知是第一个开源的Pytorch版本WWM.Our WWM scripts is refacted from Roberta_zh_Tensorflow, as far as we know, it is the first open source Whole-word-masking scripts in Pytorch.
模型
max_seq_length = 4096
, 在 4 * Titan RTX 上预训练3K steps 大概用时4天。Max seuence length is 4096 and the pretraining took 4 days on 4 * Titan RTX.
我们使用了
Nvidia.Apex
引入了混合精度训练,以加速预训练。We use
Nvidia.Apex
to accelerate pretraining.关于数据预处理, 我们采用
Jieba
分词与JIONLP
进行数据清洗。We use
Jieba
Chinese tokenizer andJIONLP
data cleaning.更多细节可以参考我们的预训练脚本
For more details, please check our pretraining scripts.
效果测试 | Evaluation
CCF Sentiment Analysis
- 由于中文超长文本级别任务稀缺,我们采用了CCF-Sentiment-Analysis任务进行测试
- Since it is hard to acquire open-sourced long sequence level chinese NLP task, we use CCF-Sentiment-Analysis for evaluation.
Model | Dev F |
---|---|
Bert | 80.3 |
Bert-wwm-ext | 80.5 |
Roberta-mid | 80.5 |
Roberta-large | 81.25 |
Longformer_SC | 79.37 |
Longformer_ZH | 80.51 |
Pretraining BPC
- 我们提供了预训练BPC(bits-per-character), BPC越小,代表语言模型性能更优。可视作PPL.
- We also provide BPC scores of pretraining, the lower BPC score, the better performance Langugage Model has. You can also treat it as PPL.
Model | BPC |
---|---|
Longformer before training | 14.78 |
Longformer after training | 3.10 |
CMRC(Chinese Machine Reading Comprehension)
Model | F1 | EM |
---|---|---|
Bert | 85.87 | 64.90 |
Roberta | 86.45 | 66.57 |
Longformer_zh | 86.15 | 66.84 |
Chinese Coreference Resolution
Model | Conll-F1 | Precision | Recall |
---|---|---|---|
Bert | 66.82 | 70.30 | 63.67 |
Roberta | 67.77 | 69.28 | 66.32 |
Longformer_zh | 67.81 | 70.13 | 65.64 |
致谢
感谢东京工业大学 奥村·船越研究室 提供算力。
Thanks Okumula·Funakoshi Lab from Tokyo Institute of Technology who provides the devices and oppotunity for me to finish this project.
- Downloads last month
- 466