|
--- |
|
language: |
|
- zh |
|
|
|
license: apache-2.0 |
|
|
|
tags: |
|
- classification |
|
|
|
inference: false |
|
|
|
--- |
|
|
|
# IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese |
|
|
|
- Main Page:[Fengshenbang](https://fengshenbang-lm.com/) |
|
- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM) |
|
## 简介 Brief Introduction |
|
|
|
330M参数的句子表征Topic Classification BERT (TCBert)。 |
|
|
|
The TCBert with 330M parameters is pre-trained for sentence representation for Chinese topic classification tasks. |
|
|
|
## 模型分类 Model Taxonomy |
|
|
|
| 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra | |
|
| :----: | :----: | :----: | :----: | :----: | :----: | |
|
| 通用 General | 句子表征 | 二郎神 Erlangshen | TCBert (sentence representation) | 330M | Chinese | |
|
|
|
## 模型信息 Model Information |
|
|
|
|
|
为了提高模型在话题分类上句子表征效果,我们收集了大量话题分类数据进行基于prompts的对比学习预训练。 |
|
|
|
To improve the model performance on sentence representation for the topic classification task, we collected numerous topic classification datasets for contrastive pre-training based on general prompts. |
|
### 下游效果 Performance |
|
|
|
我们为每个数据集设计了两个prompt模板。 |
|
|
|
We customize two prompts templates for each dataset. |
|
|
|
第一个prompt模板: |
|
|
|
For ***prompt template 1***: |
|
|
|
| Dataset | Prompt template 1 | |
|
|---------|:------------------------:| |
|
| TNEWS | 下面是一则关于__的新闻: | |
|
| CSLDCP | 这一句描述__的内容如下: | |
|
| IFLYTEK | 这一句描述__的内容如下: | |
|
|
|
|
|
第一个prompt模板的微调实验结果: |
|
|
|
The **fine-tuning** results for prompt template 1: |
|
|
|
| Model | TNEWS | CLSDCP | IFLYTEK | |
|
|-----------------|:------:|:------:|:-------:| |
|
| Macbert-base | 55.02 | 57.37 | 51.34 | |
|
| Macbert-large | 55.77 | 58.99 | 50.31 | |
|
| Erlangshen-1.3B | 57.36 | 62.35 | 53.23 | |
|
| TCBert-base<sub>110M-Classification-Chinese | 55.57 | 58.60 | 49.63 | |
|
| TCBert-large<sub>330M-Classification-Chinese | 56.17 | 60.06 | 51.34 | |
|
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 65.10 | 53.75 | |
|
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 | |
|
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 | |
|
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 57.46 | 65.04 | 53.06 | |
|
|
|
|
|
第一个prompt模板的句子相似度结果: |
|
|
|
The **sentence similarity** results for prompt template 1: |
|
|
|
| | TNEWS | | CSLDCP | | IFLYTEK | | |
|
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:| |
|
| Model | referece | whitening | reference | whitening | reference | whitening | |
|
| Macbert-base | 43.53 | 47.16 | 33.50 | 36.53 | 28.99 | 33.85 | |
|
| Macbert-large | 46.17 | 49.35 | 37.65 | 39.38 | 32.36 | 35.33 | |
|
| Erlangshen-1.3B | 45.72 | 49.60 | 40.56 | 44.26 | 29.33 | 36.48 | |
|
| TCBert-base<sub>110M-Classification-Chinese | 48.61 | 51.99 | 43.31 | 45.15 | 33.45 | 37.28 | |
|
| TCBert-large<sub>330M-Classification-Chinese | 50.50 | 52.79 | 52.89 | 53.89 | 34.93 | 38.31 | |
|
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 50.80 | 51.59 | 51.93 | 54.12 | 33.96 | 38.08 | |
|
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 45.82 | 47.06 | 42.91 | 43.87 | 33.28 | 34.76 | |
|
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.10 | 50.90 | 53.78 | 53.33 | 37.62 | 36.94 | |
|
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.70 | 53.48 | 52.66 | 54.40 | 36.88 | 38.48 | |
|
|
|
第二个prompt模板: |
|
|
|
For ***prompt template 2***: |
|
| Dataset | Prompt template 2 | |
|
|---------|:------------------------:| |
|
| TNEWS | 接下来的新闻,是跟__相关的内容: | |
|
| CSLDCP | 接下来的学科,是跟__相关: | |
|
| IFLYTEK | 接下来的生活内容,是跟__相关: | |
|
|
|
第二个prompt模板的微调结果: |
|
|
|
The **fine-tuning** results for prompt template 2: |
|
|
|
| Model | TNEWS | CLSDCP | IFLYTEK | |
|
|-----------------|:------:|:------:|:-------:| |
|
| Macbert-base | 54.78 | 58.38 | 50.83 | |
|
| Macbert-large | 56.77 | 60.22 | 51.63 | |
|
| Erlangshen-1.3B | 57.81 | 62.80 | 52.77 | |
|
| TCBert-base<sub>110M-Classification-Chinese | 54.58 | 59.16 | 49.80 | |
|
| TCBert-large<sub>330M-Classification-Chinese | 56.22 | 61.23 | 50.77 | |
|
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 57.41 | 64.82 | 53.34 | |
|
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 54.68 | 59.78 | 49.40 | |
|
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 55.32 | 62.07 | 51.11 | |
|
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 56.87 | 65.83 | 52.94 | |
|
|
|
|
|
第二个prompt模板的句子相似度结果: |
|
|
|
The **sentence similarity** results for prompt template 2: |
|
|
|
| | TNEWS | | CSLDCP | | IFLYTEK | | |
|
|-----------------|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:| |
|
| Model | referece | whitening | reference | whitening | reference | whitening | |
|
| Macbert-base | 42.29 | 45.22 | 34.23 | 37.48 | 29.62 | 34.13 | |
|
| Macbert-large | 46.22 | 49.60 | 40.11 | 44.26 | 32.36 | 35.16 | |
|
| Erlangshen-1.3B | 46.17 | 49.10 | 40.45 | 45.88 | 30.36 | 36.88 | |
|
| TCBert-base<sub>110M-Classification-Chinese | 48.31 | 51.34 | 43.42 | 45.27 | 33.10 | 36.19 | |
|
| TCBert-large<sub>330M-Classification-Chinese | 51.19 | 51.69 | 52.55 | 53.28 | 34.31 | 37.45 | |
|
| TCBert-1.3B<sub>1.3B-Classification-Chinese | 52.14 | 52.39 | 51.71 | 53.89 | 33.62 | 38.14 | |
|
| TCBert-base<sub>110M-Sentence-Embedding-Chinese | 46.72 | 48.86 | 43.19 | 43.53 | 34.08 | 35.79 | |
|
| TCBert-large<sub>330M-Sentence-Embedding-Chinese | 50.65 | 51.94 | 53.84 | 53.67 | 37.74 | 36.65 | |
|
| TCBert-1.3B<sub>1.3B-Sentence-Embedding-Chinese | 50.75 | 54.78 | 51.43 | 54.34 | 36.48 | 38.36 | |
|
|
|
|
|
更多关于TCBERTs的细节,请参考我们的技术报告。基于新的数据,我们会更新TCBERTs,请留意我们仓库的更新。 |
|
|
|
For more details about TCBERTs, please refer to our paper. We may regularly update TCBERTs upon new coming data, please keep an eye on the repo! |
|
|
|
## 使用 Usage |
|
### 使用示例 Usage Examples |
|
|
|
```python |
|
# Prompt-based MLM fine-tuning |
|
from transformers import BertForMaskedLM, BertTokenizer |
|
import torch |
|
|
|
# Loading models |
|
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese") |
|
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese") |
|
|
|
# Prepare the data |
|
inputs = tokenizer("下面是一则关于[MASK][MASK]的新闻:怎样的房子才算户型方正?", return_tensors="pt") |
|
labels = tokenizer("下面是一则关于房产的新闻:怎样的房子才算户型方正?", return_tensors="pt")["input_ids"] |
|
labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100) |
|
|
|
# Output the loss |
|
outputs = model(**inputs, labels=labels) |
|
loss = outputs.loss |
|
|
|
``` |
|
|
|
```python |
|
# Prompt-based Sentence Similarity |
|
# To extract sentence representations. |
|
from transformers import BertForMaskedLM, BertTokenizer |
|
import torch |
|
|
|
# Loading models |
|
tokenizer=BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese") |
|
model=BertForMaskedLM.from_pretrained("IDEA-CCNL/Erlangshen-TCBert-330M-Sentence-Embedding-Chinese") |
|
|
|
# Cosine similarity function |
|
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-8) |
|
|
|
with torch.no_grad(): |
|
|
|
# To extract sentence representations for training data |
|
training_input = tokenizer("怎样的房子才算户型方正?", return_tensors="pt") |
|
training_output = BertForMaskedLM(**token_text, output_hidden_states=True) |
|
training_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0) |
|
|
|
# To extract sentence representations for training data |
|
test_input = tokenizer("下面是一则关于[MASK][MASK]的新闻:股票放量下趺,大资金出逃谁在接盘?", return_tensors="pt") |
|
test_output = BertForMaskedLM(**token_text, output_hidden_states=True) |
|
test_representation = torch.mean(training_outputs.hidden_states[-1].squeeze(), dim=0) |
|
|
|
# Calculate similarity scores |
|
similarity_score = cos(training_representation, test_representation) |
|
``` |
|
|
|
## 引用 Citation |
|
|
|
如果您在您的工作中使用了我们的模型,可以引用我们的[技术报告](https://arxiv.org/abs/2211.11304): |
|
|
|
If you use for your work, please cite the following paper |
|
|
|
``` |
|
@article{han2022tcbert, |
|
title={TCBERT: A Technical Report for Chinese Topic Classification BERT}, |
|
author={Han, Ting and Pan, Kunhao and Chen, Xinyu and Song, Dingjie and Fan, Yuchen and Gao, Xinyu and Gan, Ruyi and Zhang, Jiaxing}, |
|
journal={arXiv preprint arXiv:2211.11304}, |
|
year={2022} |
|
} |
|
``` |
|
|
|
如果您在您的工作中使用了我们的模型,可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/): |
|
|
|
You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/): |
|
|
|
```text |
|
@misc{Fengshenbang-LM, |
|
title={Fengshenbang-LM}, |
|
author={IDEA-CCNL}, |
|
year={2021}, |
|
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, |
|
} |
|
``` |