idea-teacher
commited on
Commit
•
b47bd71
1
Parent(s):
9cac031
Update README.md
Browse files
README.md
CHANGED
@@ -36,15 +36,168 @@ The TCBert with 1.3BM parameters is pre-trained for, not limited to, Chinese top
|
|
36 |
To improve the model performance on the topic classification task, we collected numerous topic classification datasets for pre-training based on general prompts.
|
37 |
### 下游效果 Performance
|
38 |
|
39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
## 使用 Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
```python
|
44 |
-
|
|
|
|
|
45 |
import torch
|
|
|
|
|
46 |
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-1.3B-Classification-Chinese")
|
47 |
-
model=
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
```
|
49 |
|
50 |
如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
|
|
|
36 |
To improve the model performance on the topic classification task, we collected numerous topic classification datasets for pre-training based on general prompts.
|
37 |
### 下游效果 Performance
|
38 |
|
39 |
+
我们为每个数据集设计了两个prompt模板。
|
40 |
+
|
41 |
+
We customize two prompts templates for each dataset.
|
42 |
+
|
43 |
+
第一个prompt模板:
|
44 |
+
|
45 |
+
For ***prompt template 1***:
|
46 |
+
|
47 |
+
| Dataset | Prompt template 1 |
|
48 |
+
|---------|:------------------------:|
|
49 |
+
| TNEWS | 下面是一则关于__的新闻: |
|
50 |
+
| CSLDCP | 这一句描述__的内容如下: |
|
51 |
+
| IFLYTEK | 这一句描述__的内容如下: |
|
52 |
+
|
53 |
+
|
54 |
+
第一个prompt模板的微调实验结果:
|
55 |
+
|
56 |
+
The **fine-tuning** results for prompt template 1:
|
57 |
+
|
58 |
+
| Model | TNEWS | CLSDCP | IFLYTEK |
|
59 |
+
|-----------------|:------:|:------:|:-------:|
|
60 |
+
| Macbert-base | 55.02 | 57.37 | 51.34 |
|
61 |
+
| Macbert-large | 55.77 | 58.99 | 50.31 |
|
62 |
+
| Erlangshen-1.3B | 57.36 | 62.35 | 53.23 |
|
63 |
+
| TCBert-base<sub>110M-Classification-Chinese | 55.57 | 58.60 | 49.63 |
|
64 |
+
| TCBert-large<sub>330M-Classification-Chinese | 56.17 | 60.06 | 51.34 |
|
65 |
+
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 65.10 | 53.75 |
|
66 |
+
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 |
|
67 |
+
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 |
|
68 |
+
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 57.46 | 65.04 | 53.06 |
|
69 |
+
|
70 |
+
|
71 |
+
第一个prompt模板的句子相似度结果:
|
72 |
+
|
73 |
+
The **sentence similarity** results for prompt template 1:
|
74 |
+
|
75 |
+
| | TNEWS | | CSLDCP | | IFLYTEK | |
|
76 |
+
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
|
77 |
+
| Model | referece | whitening | reference | whitening | reference | whitening |
|
78 |
+
| Macbert-base | 43.53 | 47.16 | 33.50 | 36.53 | 28.99 | 33.85 |
|
79 |
+
| Macbert-large | 46.17 | 49.35 | 37.65 | 39.38 | 32.36 | 35.33 |
|
80 |
+
| Erlangshen-1.3B | 45.72 | 49.60 | 40.56 | 44.26 | 29.33 | 36.48 |
|
81 |
+
| TCBert-base<sub>110M-Classification-Chinese | 48.61 | 51.99 | 43.31 | 45.15 | 33.45 | 37.28 |
|
82 |
+
| TCBert-large<sub>330M-Classification-Chinese | 50.50 | 52.79 | 52.89 | 53.89 | 34.93 | 38.31 |
|
83 |
+
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 50.80 | 51.59 | 51.93 | 54.12 | 33.96 | 38.08 |
|
84 |
+
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 45.82 | 47.06 | 42.91 | 43.87 | 33.28 | 34.76 |
|
85 |
+
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.10 | 50.90 | 53.78 | 53.33 | 37.62 | 36.94 |
|
86 |
+
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.70 | 53.48 | 52.66 | 54.40 | 36.88 | 38.48 |
|
87 |
+
|
88 |
+
第二个prompt模板:
|
89 |
+
|
90 |
+
For ***prompt template 2***:
|
91 |
+
| Dataset | Prompt template 2 |
|
92 |
+
|---------|:------------------------:|
|
93 |
+
| TNEWS | 接下来的新闻,是跟__相关的内容: |
|
94 |
+
| CSLDCP | 接下来的学科,是跟__相关: |
|
95 |
+
| IFLYTEK | 接下来的生活内容,是跟__相关: |
|
96 |
+
|
97 |
+
第二个prompt模板的微调结果:
|
98 |
+
|
99 |
+
The **fine-tuning** results for prompt template 2:
|
100 |
+
|
101 |
+
| Model | TNEWS | CLSDCP | IFLYTEK |
|
102 |
+
|-----------------|:------:|:------:|:-------:|
|
103 |
+
| Macbert-base | 54.78 | 58.38 | 50.83 |
|
104 |
+
| Macbert-large | 56.77 | 60.22 | 51.63 |
|
105 |
+
| Erlangshen-1.3B | 57.81 | 62.80 | 52.77 |
|
106 |
+
| TCBert-base<sub>110M-Classification-Chinese | 54.58 | 59.16 | 49.80 |
|
107 |
+
| TCBert-large<sub>330M-Classification-Chinese | 56.22 | 61.23 | 50.77 |
|
108 |
+
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 64.82 | 53.34 |
|
109 |
+
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 |
|
110 |
+
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 |
|
111 |
+
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 56.87 | 65.83 | 52.94 |
|
112 |
+
|
113 |
+
|
114 |
+
第二个prompt模板的句子相似度结果:
|
115 |
+
|
116 |
+
The **sentence similarity** results for prompt template 2:
|
117 |
+
|
118 |
+
| | TNEWS | | CSLDCP | | IFLYTEK | |
|
119 |
+
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
|
120 |
+
| Model | referece | whitening | reference | whitening | reference | whitening |
|
121 |
+
| Macbert-base | 42.29 | 45.22 | 34.23 | 37.48 | 29.62 | 34.13 |
|
122 |
+
| Macbert-large | 46.22 | 49.60 | 40.11 | 44.26 | 32.36 | 35.16 |
|
123 |
+
| Erlangshen-1.3B | 46.17 | 49.10 | 40.45 | 45.88 | 30.36 | 36.88 |
|
124 |
+
| TCBert-base<sub>110M-Classification-Chinese | 48.31 | 51.34 | 43.42 | 45.27 | 33.10 | 36.19 |
|
125 |
+
| TCBert-large<sub>330M-Classification-Chinese | 51.19 | 51.69 | 52.55 | 53.28 | 34.31 | 37.45 |
|
126 |
+
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 52.14 | 52.39 | 51.71 | 53.89 | 33.62 | 38.14 |
|
127 |
+
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 46.72 | 48.86 | 43.19 | 43.53 | 34.08 | 35.79 |
|
128 |
+
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.65 | 51.94 | 53.84 | 53.67 | 37.74 | 36.65 |
|
129 |
+
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.75 | 54.78 | 51.43 | 54.34 | 36.48 | 38.36 |
|
130 |
+
|
131 |
+
|
132 |
+
更多关于TCBERTs的细节,请参考我们的技术报告。基于新的数据,我们会更新TCBERTs,请留意我们仓库的更新。
|
133 |
+
|
134 |
+
For more details about TCBERTs, please refer to our paper. We may regularly update TCBERTs upon new coming data, please keep an eye on the repo!
|
135 |
|
136 |
## 使用 Usage
|
137 |
+
### 使用示例 Usage Examples
|
138 |
+
|
139 |
+
```python
|
140 |
+
# Prompt-based MLM fine-tuning
|
141 |
+
from transformers import BertForMaskedLM, BertTokenizer
|
142 |
+
import torch
|
143 |
+
|
144 |
+
# Loading models
|
145 |
+
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-1.3B-Classification-Chinese")
|
146 |
+
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-1.3B-Classification-Chinese")
|
147 |
+
|
148 |
+
# Prepare the data
|
149 |
+
inputs = tokenizer("下面是一则关于[MASK][MASK]的新闻:怎样的房子才算户型方正?", return_tensors="pt")
|
150 |
+
labels = tokenizer("下面是一则关于房产的新闻:怎样的房子才算户型方正?", return_tensors="pt")["input_ids"]
|
151 |
+
labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)
|
152 |
+
|
153 |
+
# Output the loss
|
154 |
+
outputs = model(**inputs, labels=labels)
|
155 |
+
loss = outputs.loss
|
156 |
+
|
157 |
+
```
|
158 |
|
159 |
```python
|
160 |
+
# Prompt-based Sentence Similarity
|
161 |
+
# To extract sentence representations.
|
162 |
+
from transformers import BertForMaskedLM, BertTokenizer
|
163 |
import torch
|
164 |
+
|
165 |
+
# Loading models
|
166 |
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-1.3B-Classification-Chinese")
|
167 |
+
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-1.3B-Classification-Chinese")
|
168 |
+
|
169 |
+
# Cosine similarity function
|
170 |
+
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-8)
|
171 |
+
|
172 |
+
with torch.no_grad():
|
173 |
+
|
174 |
+
# To extract sentence representations for training data
|
175 |
+
training_input = tokenizer("怎样的房子才算户型方正?", return_tensors="pt")
|
176 |
+
training_output = BertForMaskedLM(**token_text, output_hidden_states=True)
|
177 |
+
training_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)
|
178 |
+
|
179 |
+
# To extract sentence representations for training data
|
180 |
+
test_input = tokenizer("下面是一则关于[MASK][MASK]的新闻:股票放量下趺,大资金出逃谁在接盘?", return_tensors="pt")
|
181 |
+
test_output = BertForMaskedLM(**token_text, output_hidden_states=True)
|
182 |
+
test_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)
|
183 |
+
|
184 |
+
# Calculate similarity scores
|
185 |
+
similarity_score = cos(training_input, test_input)
|
186 |
+
```
|
187 |
+
|
188 |
+
## 引用 Citation
|
189 |
+
|
190 |
+
如果您在您的工作中使用了我们的模型,可以引用我们的[技术报告](https://arxiv.org/abs/2211.11304):
|
191 |
+
|
192 |
+
If you use for your work, please cite the following paper
|
193 |
+
|
194 |
+
```
|
195 |
+
@article{han2022tcbert,
|
196 |
+
title={TCBERT: A Technical Report for Chinese Topic Classification BERT},
|
197 |
+
author={Han, Ting and Pan, Kunhao and Chen, Xinyu and Song, Dingjie and Fan, Yuchen and Gao, Xinyu and Gan, Ruyi and Zhang, Jiaxing},
|
198 |
+
journal={arXiv preprint arXiv:2211.11304},
|
199 |
+
year={2022}
|
200 |
+
}
|
201 |
```
|
202 |
|
203 |
如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
|