metadata

language:
  - zh
tags:
  - cpm
license: mit
datasets:
  - 100GB Chinese corpus

CPM-Generate

Model description

CPM (Chinese Pre-trained Language Model) is a Transformer-based autoregressive language model, with 2.6 billion parameters and 100GB Chinese training data. To the best of our knowledge, CPM is the largest Chinese pre-trained language model, which could facilitate downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. [Project] [Model] [Paper]

Intended uses & limitations

How to use

from transformers import TextGenerationPipeline, AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("TsinghuaAI/CPM-Generate")
model = AutoModelWithLMHead.from_pretrained("TsinghuaAI/CPM-Generate")

text_generator = TextGenerationPipeline(model, tokenizer)
text_generator('清华大学', max_length=50, do_sample=True, top_p=0.9)

Limitations and bias

The text generated by CPM is automatically generated by a neural network model trained on a large number of texts, which does not represent our official attitudes and preferences. The text generated by CPM is only used for technical and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it, but contact us and we will deal with it promptly.

Training data

We collect different kinds of texts in our pre-training, including encyclopedia, news, novels, and Q&A. The details of our training data are shown as follows.

Data Source	Encyclopedia	Webpage	Story	News	Dialog
Size	~40GB	~39GB	~10GB	~10GB	~1GB

Training procedure

Based on the hyper-parameter searching on the learning rate and batch size, we set the learning rate as $1.5\times10^{-4}$ and the batch size as $3,072$, which makes the model training more stable. In the first version, we still adopt the dense attention and the max sequence length is $1,024$. We will implement sparse attention in the future. We pre-train our model for $20,000$ steps, and the first $5,000$ steps are for warm-up. The optimizer is Adam. It takes two weeks to train our largest model using $64$ NVIDIA V100.

Eval results

	n_param	n_layers	d_model	n_heads	d_head
CPM-Small	109M	12	768	12	64
CPM-Medium	334M	24	1,024	16	64
CPM-Large	2.6B	32	2,560	32	80

We evaluate CPM with different numbers of parameters (the details are shown above) on various Chinese NLP tasks in the few-shot (even zero-shot) settings. With the increase of parameters, CPM performs better on most datasets, indicating that larger models are more proficient at language generation and language understanding. We provide results of text classification, chinese idiom cloze test, and short text conversation generation as follows. Please refer to our paper for more detailed results.

Zero-shot performance on text classification tasks

	TNEWS	IFLYTEK	OCNLI
CPM-Small	0.626	0.584	0.378
CPM-Medium	0.618	0.635	0.379
CPM-Large	0.703	0.708	0.442

Performance on Chinese Idiom Cloze (ChID) dataset

	Supervised	Unsupervised
CPM-Small	0.657	0.433
CPM-Medium	0.695	0.524
CPM-Large	0.804	0.685

Performance on Short Text Conversation Generation (STC) dataset

	Average	Extrema	Greedy	Dist-1	Dist-2
Few-shot (Unsupervised)
CDial-GPT	0.899	0.797	0.810	1,963 / 0.011	20,814 / 0.126
CPM-Large	0.928	0.805	0.815	3,229 / 0.007	68,008 / 0.154
Supervised
CDial-GPT	0.933	0.814	0.826	2,468 / 0.008	35,634 / 0.127
CPM-Large	0.934	0.810	0.819	3,352 / 0.011	67,310 / 0.233

BibTeX entry and citation info

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}