idea-teacher commited on
Commit
9ffce9c
1 Parent(s): 8bc1a2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -2
README.md CHANGED
@@ -36,17 +36,169 @@ The TCBert with 330M parameters is pre-trained for sentence representation for C
36
  To improve the model performance on sentence representation for the topic classification task, we collected numerous topic classification datasets for contrastive pre-training based on general prompts.
37
  ### 下游效果 Performance
38
 
39
- Stay tuned.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## 使用 Usage
 
42
 
43
  ```python
 
44
  from transformers import BertForMaskedLM, BertTokenizer
45
  import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
47
  model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ```
49
- Stay tuned for more details on usage for sentence representation.
50
 
51
  如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
52
 
 
36
  To improve the model performance on sentence representation for the topic classification task, we collected numerous topic classification datasets for contrastive pre-training based on general prompts.
37
  ### 下游效果 Performance
38
 
39
+ 我们为每个数据集设计了两个prompt模板。
40
+
41
+ We customize two prompts templates for each dataset.
42
+
43
+ 第一个prompt模板:
44
+
45
+ For ***prompt template 1***:
46
+
47
+ | Dataset | Prompt template 1 |
48
+ |---------|:------------------------:|
49
+ | TNEWS | 下面是一则关于__的新闻: |
50
+ | CSLDCP | 这一句描述__的内容如下: |
51
+ | IFLYTEK | 这一句描述__的内容如下: |
52
+
53
+
54
+ 第一个prompt模板的微调实验结果:
55
+
56
+ The **fine-tuning** results for prompt template 1:
57
+
58
+ | Model | TNEWS | CLSDCP | IFLYTEK |
59
+ |-----------------|:------:|:------:|:-------:|
60
+ | Macbert-base | 55.02 | 57.37 | 51.34 |
61
+ | Macbert-large | 55.77 | 58.99 | 50.31 |
62
+ | Erlangshen-1.3B | 57.36 | 62.35 | 53.23 |
63
+ | TCBert-base<sub>110M-Classification-Chinese | 55.57 | 58.60 | 49.63 |
64
+ | TCBert-large<sub>330M-Classification-Chinese | 56.17 | 60.06 | 51.34 |
65
+ | TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 65.10 | 53.75 |
66
+ | TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 |
67
+ | TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 |
68
+ | TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 57.46 | 65.04 | 53.06 |
69
+
70
+
71
+ 第一个prompt模板的句子相似度结果:
72
+
73
+ The **sentence similarity** results for prompt template 1:
74
+
75
+ | | TNEWS | | CSLDCP | | IFLYTEK | |
76
+ |-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
77
+ | Model | referece | whitening | reference | whitening | reference | whitening |
78
+ | Macbert-base | 43.53 | 47.16 | 33.50 | 36.53 | 28.99 | 33.85 |
79
+ | Macbert-large | 46.17 | 49.35 | 37.65 | 39.38 | 32.36 | 35.33 |
80
+ | Erlangshen-1.3B | 45.72 | 49.60 | 40.56 | 44.26 | 29.33 | 36.48 |
81
+ | TCBert-base<sub>110M-Classification-Chinese | 48.61 | 51.99 | 43.31 | 45.15 | 33.45 | 37.28 |
82
+ | TCBert-large<sub>330M-Classification-Chinese | 50.50 | 52.79 | 52.89 | 53.89 | 34.93 | 38.31 |
83
+ | TCBert-1.3B<sub>1.3B-Classification-Chinese | 50.80 | 51.59 | 51.93 | 54.12 | 33.96 | 38.08 |
84
+ | TCBert-base<sub>110M-Sentence-Embedding-Chinese | 45.82 | 47.06 | 42.91 | 43.87 | 33.28 | 34.76 |
85
+ | TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.10 | 50.90 | 53.78 | 53.33 | 37.62 | 36.94 |
86
+ | TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.70 | 53.48 | 52.66 | 54.40 | 36.88 | 38.48 |
87
+
88
+ 第二个prompt模板:
89
+
90
+ For ***prompt template 2***:
91
+ | Dataset | Prompt template 2 |
92
+ |---------|:------------------------:|
93
+ | TNEWS | 接下来的新闻,是跟__相关的内容: |
94
+ | CSLDCP | 接下来的学科,是跟__相关: |
95
+ | IFLYTEK | 接下来的生活内容,是跟__相关: |
96
+
97
+ 第二个prompt模板的微调结果:
98
+
99
+ The **fine-tuning** results for prompt template 2:
100
+
101
+ | Model | TNEWS | CLSDCP | IFLYTEK |
102
+ |-----------------|:------:|:------:|:-------:|
103
+ | Macbert-base | 54.78 | 58.38 | 50.83 |
104
+ | Macbert-large | 56.77 | 60.22 | 51.63 |
105
+ | Erlangshen-1.3B | 57.81 | 62.80 | 52.77 |
106
+ | TCBert-base<sub>110M-Classification-Chinese | 54.58 | 59.16 | 49.80 |
107
+ | TCBert-large<sub>330M-Classification-Chinese | 56.22 | 61.23 | 50.77 |
108
+ | TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 64.82 | 53.34 |
109
+ | TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 |
110
+ | TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 |
111
+ | TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 56.87 | 65.83 | 52.94 |
112
+
113
+
114
+ 第二个prompt模板的句子相似度结果:
115
+
116
+ The **sentence similarity** results for prompt template 2:
117
+
118
+ | | TNEWS | | CSLDCP | | IFLYTEK | |
119
+ |-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|
120
+ | Model | referece | whitening | reference | whitening | reference | whitening |
121
+ | Macbert-base | 42.29 | 45.22 | 34.23 | 37.48 | 29.62 | 34.13 |
122
+ | Macbert-large | 46.22 | 49.60 | 40.11 | 44.26 | 32.36 | 35.16 |
123
+ | Erlangshen-1.3B | 46.17 | 49.10 | 40.45 | 45.88 | 30.36 | 36.88 |
124
+ | TCBert-base<sub>110M-Classification-Chinese | 48.31 | 51.34 | 43.42 | 45.27 | 33.10 | 36.19 |
125
+ | TCBert-large<sub>330M-Classification-Chinese | 51.19 | 51.69 | 52.55 | 53.28 | 34.31 | 37.45 |
126
+ | TCBert-1.3B<sub>1.3B-Classification-Chinese | 52.14 | 52.39 | 51.71 | 53.89 | 33.62 | 38.14 |
127
+ | TCBert-base<sub>110M-Sentence-Embedding-Chinese | 46.72 | 48.86 | 43.19 | 43.53 | 34.08 | 35.79 |
128
+ | TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.65 | 51.94 | 53.84 | 53.67 | 37.74 | 36.65 |
129
+ | TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.75 | 54.78 | 51.43 | 54.34 | 36.48 | 38.36 |
130
+
131
+
132
+ 更多关于TCBERTs的细节,请参考我们的技术报告。基于新的数据,我们会更新TCBERTs,请留意我们仓库的更新。
133
+
134
+ For more details about TCBERTs, please refer to our paper. We may regularly update TCBERTs upon new coming data, please keep an eye on the repo!
135
 
136
  ## 使用 Usage
137
+ ### 使用示例 Usage Examples
138
 
139
  ```python
140
+ # Prompt-based MLM fine-tuning
141
  from transformers import BertForMaskedLM, BertTokenizer
142
  import torch
143
+
144
+ # Loading models
145
+ tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
146
+ model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
147
+
148
+ # Prepare the data
149
+ inputs = tokenizer("下面是一则关于[MASK][MASK]的新闻:怎样的房子才算户型方正?", return_tensors="pt")
150
+ labels = tokenizer("下面是一则关于房产的新闻:怎样的房子才算户型方正?", return_tensors="pt")["input_ids"]
151
+ labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)
152
+
153
+ # Output the loss
154
+ outputs = model(**inputs, labels=labels)
155
+ loss = outputs.loss
156
+
157
+ ```
158
+
159
+ ```python
160
+ # Prompt-based Sentence Similarity
161
+ # To extract sentence representations.
162
+ from transformers import BertForMaskedLM, BertTokenizer
163
+ import torch
164
+
165
+ # Loading models
166
  tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
167
  model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese")
168
+
169
+ # Cosine similarity function
170
+ cos = torch.nn.CosineSimilarity(dim=0, eps=1e-8)
171
+
172
+ with torch.no_grad():
173
+
174
+ # To extract sentence representations for training data
175
+ training_input = tokenizer("怎样的房子才算户型方正?", return_tensors="pt")
176
+ training_output = BertForMaskedLM(**token_text, output_hidden_states=True)
177
+ training_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)
178
+
179
+ # To extract sentence representations for training data
180
+ test_input = tokenizer("下面是一则关于[MASK][MASK]的新闻:股票放量下趺,大资金出逃谁在接盘?", return_tensors="pt")
181
+ test_output = BertForMaskedLM(**token_text, output_hidden_states=True)
182
+ test_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0)
183
+
184
+ # Calculate similarity scores
185
+ similarity_score = cos(training_input, test_input)
186
+ ```
187
+
188
+ ## 引用 Citation
189
+
190
+ 如果您在您的工作中使用了我们的模型,可以引用我们的[技术报告](https://arxiv.org/abs/2211.11304):
191
+
192
+ If you use for your work, please cite the following paper
193
+
194
+ ```
195
+ @article{han2022tcbert,
196
+ title={TCBERT: A Technical Report for Chinese Topic Classification BERT},
197
+ author={Han, Ting and Pan, Kunhao and Chen, Xinyu and Song, Dingjie and Fan, Yuchen and Gao, Xinyu and Gan, Ruyi and Zhang, Jiaxing},
198
+ journal={arXiv preprint arXiv:2211.11304},
199
+ year={2022}
200
+ }
201
  ```
 
202
 
203
  如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
204