hhou435 commited on
Commit
c790c1b
1 Parent(s): 0a50d2f

Update README

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpusSmall
4
+ widget:
5
+ - text: "这是很久之前的事情了"
6
+
7
+
8
+ ---
9
+
10
+
11
+ # Chinese GPT2 Model
12
+
13
+ ## Model description
14
+
15
+ The model is used to generate Chinese texts. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-distil-chinese-cluecorpussmall](https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall).
16
+
17
+ ## How to use
18
+
19
+ You can use the model directly with a pipeline for text generation:
20
+
21
+ ```python
22
+ >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
23
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
24
+ >>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
25
+ >>> text_generator = TextGenerationPipeline(model, tokenizer)
26
+ >>> text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
27
+ [{'generated_text': '这是很久之前的事情了 ! 这 件 事 情 之 后 我 每 天 都 问 自 己 , 对 未 来 的 影 响 是 什 么 ? 在 这 个 过 程 中 我 一 直 提 高 自 己 的 理 论 和 实 践 能 力 , 比 如 说 , 我 们 现 在 有 很 多 很 多 的 投 资 行 为 可 以 赚 钱 , 在 美 国 有 很 多 交 易 行 为 , 是 一 个 比 较 灵 活 的 模'}]
28
+ ```
29
+
30
+
31
+
32
+ ## Training data
33
+
34
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
35
+
36
+ ## Training procedure
37
+
38
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
39
+
40
+ Stage1:
41
+
42
+ ```
43
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
44
+ --vocab_path models/google_zh_vocab.txt \
45
+ --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
46
+ --seq_length 128 --processes_num 32 --target lm
47
+ ```
48
+
49
+ ```
50
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
51
+ --vocab_path models/google_zh_vocab.txt \
52
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin \
53
+ --config_path models/gpt2/distil_config.json \
54
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
55
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
56
+ --learning_rate 1e-4 --batch_size 64 \
57
+ --embedding word_pos --remove_embedding_layernorm \
58
+ --encoder transformer --mask causal --layernorm_positioning pre \
59
+ --target lm --tie_weight
60
+ ```
61
+
62
+ Stage2:
63
+
64
+ ```
65
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
66
+ --vocab_path models/google_zh_vocab.txt \
67
+ --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
68
+ --seq_length 1024 --processes_num 32 --target lm
69
+ ```
70
+
71
+ ```
72
+ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
73
+ --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin-1000000 \
74
+ --vocab_path models/google_zh_vocab.txt \
75
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
76
+ --config_path models/gpt2/distil_config.json \
77
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
78
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
79
+ --learning_rate 5e-5 --batch_size 16 \
80
+ --embedding word_pos --remove_embedding_layernorm \
81
+ --encoder transformer --mask causal --layernorm_positioning pre \
82
+ --target lm --tie_weight
83
+ ```
84
+
85
+ Finally, we convert the pre-trained model into Huggingface's format:
86
+
87
+ ```
88
+ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_distil_seq1024_model.bin-250000 \
89
+ --output_model_path pytorch_model.bin \
90
+ --layers_num 12
91
+ ```
92
+
93
+ ### BibTeX entry and citation info
94
+
95
+ ```
96
+ @article{zhao2019uer,
97
+ title={UER: An Open-Source Toolkit for Pre-training Models},
98
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
99
+ journal={EMNLP-IJCNLP 2019},
100
+ pages={241},
101
+ year={2019}
102
+ }
103
+ ```