File size: 9,723 Bytes
9f59c9a
09ff910
 
 
9f59c9a
09ff910
 
 
 
 
 
9f59c9a
09ff910
 
 
20f85db
09ff910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ffce9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09ff910
 
9ffce9c
09ff910
 
9ffce9c
09ff910
 
9ffce9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09ff910
 
9ffce9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89f4b49
9ffce9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09ff910
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
language: 
  - zh

license: apache-2.0

tags:
  - classification

inference: false

---

# IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese

- Main Page:[Fengshenbang](https://fengshenbang-lm.com/)
- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
## 简介 Brief Introduction

330M参数的句子表征Topic Classification BERT (TCBert)。

The TCBert with 330M parameters is pre-trained for sentence representation for Chinese topic classification tasks.

## 模型分类 Model Taxonomy

|  需求 Demand  | 任务 Task       | 系列 Series      | 模型 Model    | 参数 Parameter | 额外 Extra |
|  :----:  | :----:  | :----:  | :----:  | :----:  | :----:  |
| 通用 General  | 句子表征 | 二郎神 Erlangshen | TCBert (sentence representation)      |      330M     |   Chinese     |

## 模型信息 Model Information


为了提高模型在话题分类上句子表征效果,我们收集了大量话题分类数据进行基于prompts的对比学习预训练。

To improve the model performance on sentence representation for the topic classification task, we collected numerous topic classification datasets for contrastive pre-training based on general prompts.
### 下游效果 Performance

我们为每个数据集设计了两个prompt模板。

We customize two prompts templates for each dataset.

第一个prompt模板:

For ***prompt template 1***:

| Dataset |     Prompt template 1    |
|---------|:------------------------:|
| TNEWS   | 下面是一则关于__的新闻: |
| CSLDCP  | 这一句描述__的内容如下: |
| IFLYTEK | 这一句描述__的内容如下: |


第一个prompt模板的微调实验结果:

The **fine-tuning** results for prompt template 1:

| Model           |  TNEWS | CLSDCP | IFLYTEK |
|-----------------|:------:|:------:|:-------:|
| Macbert-base    | 55.02  | 57.37  |  51.34  |
| Macbert-large   | 55.77  | 58.99  |  50.31  |
| Erlangshen-1.3B | 57.36  | 62.35  |  53.23  |
| TCBert-base<sub>110M-Classification-Chinese     | 55.57  | 58.60  |  49.63  |
| TCBert-large<sub>330M-Classification-Chinese    | 56.17  | 60.06  |  51.34  |
| TCBert-1.3B<sub>1.3B-Classification-Chinese     | 57.41  | 65.10  |  53.75  |
| TCBert-base<sub>110M-Sentence-Embedding-Chinese     | 54.68  | 59.78  |  49.40  |
| TCBert-large<sub>330M-Sentence-Embedding-Chinese    | 55.32  | 62.07  |  51.11  |
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese     | 57.46  | 65.04  |  53.06  |


第一个prompt模板的句子相似度结果:

The **sentence similarity** results for prompt template 1:

|                 |   TNEWS  |           |   CSLDCP  |           |  IFLYTEK  |           |
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| Model           | referece | whitening | reference | whitening | reference | whitening |
| Macbert-base    |  43.53   |   47.16   |   33.50   |   36.53   |   28.99   |   33.85   |
| Macbert-large   |  46.17   |   49.35   |   37.65   |   39.38   |   32.36   |   35.33   |
| Erlangshen-1.3B |  45.72   |   49.60   |   40.56   |   44.26   |   29.33   |   36.48   |
| TCBert-base<sub>110M-Classification-Chinese     |  48.61   |   51.99   |   43.31   |   45.15   |   33.45   |   37.28   |
| TCBert-large<sub>330M-Classification-Chinese    |  50.50   |   52.79   |   52.89   |   53.89   |   34.93   |   38.31   |
| TCBert-1.3B<sub>1.3B-Classification-Chinese     |  50.80   |   51.59   |   51.93   |   54.12   |   33.96   |   38.08   |
| TCBert-base<sub>110M-Sentence-Embedding-Chinese     |  45.82   |   47.06   |   42.91   |   43.87   |   33.28   |   34.76   |
| TCBert-large<sub>330M-Sentence-Embedding-Chinese    |  50.10   |   50.90   |   53.78   |   53.33   |   37.62   |   36.94   |
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese     |  50.70   |   53.48   |   52.66   |   54.40   |   36.88   |   38.48   |

第二个prompt模板:

For ***prompt template 2***:
| Dataset |     Prompt template 2    |
|---------|:------------------------:|
| TNEWS   | 接下来的新闻,是跟__相关的内容: |
| CSLDCP  | 接下来的学科,是跟__相关: |
| IFLYTEK |  接下来的生活内容,是跟__相关: |

第二个prompt模板的微调结果:

The **fine-tuning** results for prompt template 2:

| Model           |  TNEWS | CLSDCP | IFLYTEK |
|-----------------|:------:|:------:|:-------:|
| Macbert-base    | 54.78  | 58.38  |  50.83  |
| Macbert-large   | 56.77  | 60.22  |  51.63  |
| Erlangshen-1.3B | 57.81  | 62.80  |  52.77  |
| TCBert-base<sub>110M-Classification-Chinese     | 54.58  | 59.16  |  49.80  |
| TCBert-large<sub>330M-Classification-Chinese    | 56.22  | 61.23  |  50.77  |
| TCBert-1.3B<sub>1.3B-Classification-Chinese      | 57.41  | 64.82  |  53.34  |
| TCBert-base<sub>110M-Sentence-Embedding-Chinese      | 54.68  | 59.78  |  49.40  |
| TCBert-large<sub>330M-Sentence-Embedding-Chinese    | 55.32  | 62.07  |  51.11  |
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese     | 56.87  | 65.83  |  52.94  |


第二个prompt模板的句子相似度结果:

The **sentence similarity** results for prompt template 2:

|                 |   TNEWS  |           |   CSLDCP  |           |  IFLYTEK  |           |
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| Model           | referece | whitening | reference | whitening | reference | whitening |
| Macbert-base    |  42.29   |   45.22   |   34.23   |   37.48   |   29.62   |   34.13   |
| Macbert-large   |  46.22   |   49.60   |   40.11   |   44.26   |   32.36   |   35.16   |
| Erlangshen-1.3B |  46.17   |   49.10   |   40.45   |   45.88   |   30.36   |   36.88   |
| TCBert-base<sub>110M-Classification-Chinese      |  48.31   |   51.34   |   43.42   |   45.27   |   33.10   |   36.19   |
| TCBert-large<sub>330M-Classification-Chinese    |  51.19   |   51.69   |   52.55   |   53.28   |   34.31   |   37.45   |
| TCBert-1.3B<sub>1.3B-Classification-Chinese      |  52.14   |   52.39   |   51.71   |   53.89   |   33.62   |   38.14   |
| TCBert-base<sub>110M-Sentence-Embedding-Chinese     |  46.72   |   48.86   |   43.19   |   43.53   |   34.08   |   35.79   |
| TCBert-large<sub>330M-Sentence-Embedding-Chinese    |  50.65   |   51.94   |   53.84   |   53.67   |   37.74   |   36.65   |
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese     |  50.75   |   54.78   |   51.43   |   54.34   |   36.48   |   38.36   |


更多关于TCBERTs的细节,请参考我们的技术报告。基于新的数据,我们会更新TCBERTs,请留意我们仓库的更新。

For more details about TCBERTs, please refer to our paper. We may regularly update TCBERTs upon new coming data, please keep an eye on the repo!

## 使用 Usage
### 使用示例 Usage Examples

```python
# Prompt-based MLM fine-tuning
from transformers import BertForMaskedLM, BertTokenizer
import torch

# Loading models
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")

# Prepare the data
inputs = tokenizer("下面是一则关于[MASK][MASK]的新闻:怎样的房子才算户型方正?", return_tensors="pt")
labels = tokenizer("下面是一则关于房产的新闻:怎样的房子才算户型方正?", return_tensors="pt")["input_ids"]
labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)

# Output the loss
outputs = model(**inputs, labels=labels)
loss = outputs.loss

```

```python
# Prompt-based Sentence Similarity
# To extract sentence representations.
from transformers import BertForMaskedLM, BertTokenizer
import torch

# Loading models
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")

# Cosine similarity function
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-8)

with torch.no_grad():

    # To extract sentence representations for training data
    training_input = tokenizer("怎样的房子才算户型方正?", return_tensors="pt")
    training_output = BertForMaskedLM(**token_text, output_hidden_states=True)
    training_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)

    # To extract sentence representations for training data
    test_input = tokenizer("下面是一则关于[MASK][MASK]的新闻:股票放量下趺,大资金出逃谁在接盘?", return_tensors="pt")
    test_output = BertForMaskedLM(**token_text, output_hidden_states=True)
    test_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)

# Calculate similarity scores
similarity_score = cos(training_representation, test_representation)
```

## 引用 Citation

如果您在您的工作中使用了我们的模型,可以引用我们的[技术报告](https://arxiv.org/abs/2211.11304):

If you use for your work, please cite the following paper

```
@article{han2022tcbert,
  title={TCBERT: A Technical Report for Chinese Topic Classification BERT},
  author={Han, Ting and Pan, Kunhao and Chen, Xinyu and Song, Dingjie and Fan, Yuchen and Gao, Xinyu and Gan, Ruyi and Zhang, Jiaxing},
  journal={arXiv preprint arXiv:2211.11304},
  year={2022}
}
```

如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):

```text
@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}
```