![icon](/DMetaSoul/Dmeta-embedding-zh-small/resolve/main/logo.png)
Dmeta-embedding-small
- Dmeta-embedding系列模型是跨领域、跨任务、开箱即用的中文 Embedding 模型,适用于搜索、问答、智能客服、LLM+RAG 等各种业务场景,支持使用 Transformers/Sentence-Transformers/Langchain 等工具加载推理。
- Dmeta-embedding-zh-small是开源模型Dmeta-embedding-zh的蒸馏版本(8层BERT),模型大小不到300M。相较于原始版本,Dmeta-embedding-zh-small模型大小减小三分之一,推理速度提升约30%,总体精度下降约1.4%。
Evaluation
这里主要跟蒸馏前对应的 teacher 模型作了对比:
性能:(基于1万条数据测试,GPU设备是V100)
Teacher | Student | Gap | |
---|---|---|---|
Model | Dmeta-Embedding-zh (411M) | Dmeta-Embedding-zh-small (297M) | 0.67x |
Cost | 127s | 89s | -30% |
Latency | 13ms | 9ms | -31% |
Throughput | 78 sentence/s | 111 sentence/s | 1.4x |
精度:(参考自MTEB榜单)
Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Avg | |
---|---|---|---|---|---|---|---|
Dmeta-Embedding-zh | 70 | 50.96 | 88.92 | 67.17 | 70.41 | 64.89 | 67.51 |
Dmeta-Embedding-zh-small | 69.89 | 50.8 | 87.57 | 66.92 | 67.7 | 62.13 | 66.1 |
Gap | -0.11 | -0.16 | -1.35 | -0.25 | -2.71 | -2.76 | -1.41 |
Usage
目前模型支持通过 Sentence-Transformers, Langchain, Huggingface Transformers 等主流框架进行推理,具体用法参考各个框架的示例。
Sentence-Transformers
Dmeta-embedding 模型支持通过 sentence-transformers 来加载推理:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
model = SentenceTransformer('DMetaSoul/Dmeta-embedding-zh-small')
embs1 = model.encode(texts1, normalize_embeddings=True)
embs2 = model.encode(texts2, normalize_embeddings=True)
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
示例输出如下:
查询文本:胡子长得太快怎么办?
相似文本:胡子长得快怎么办?,打分:0.965681254863739
相似文本:怎样使胡子不浓密!,打分:0.7353651523590088
相似文本:香港买手表哪里好,打分:0.24928246438503265
相似文本:在杭州手机到哪里买,打分:0.2038613110780716
查询文本:在香港哪里买手表好
相似文本:香港买手表哪里好,打分:0.9916468262672424
相似文本:在杭州手机到哪里买,打分:0.498248815536499
相似文本:胡子长得快怎么办?,打分:0.2424771636724472
相似文本:怎样使胡子不浓密!,打分:0.21715955436229706
Langchain
Dmeta-embedding 模型支持通过 LLM 工具框架 langchain 来加载推理:
pip install -U langchain
import torch
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "DMetaSoul/Dmeta-embedding-zh-small"
model_kwargs = {'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
model = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
embs1 = model.embed_documents(texts1)
embs2 = model.embed_documents(texts2)
embs1, embs2 = np.array(embs1), np.array(embs2)
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
HuggingFace Transformers
Dmeta-embedding 模型支持通过 HuggingFace Transformers 框架来加载推理:
pip install -U transformers
import torch
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def cls_pooling(model_output):
return model_output[0][:, 0]
texts1 = ["胡子长得太快怎么办?", "在香港哪里买手表好"]
texts2 = ["胡子长得快怎么办?", "怎样使胡子不浓密!", "香港买手表哪里好", "在杭州手机到哪里买"]
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
model = AutoModel.from_pretrained('DMetaSoul/Dmeta-embedding-zh-small')
model.eval()
with torch.no_grad():
inputs1 = tokenizer(texts1, padding=True, truncation=True, return_tensors='pt')
inputs2 = tokenizer(texts2, padding=True, truncation=True, return_tensors='pt')
model_output1 = model(**inputs1)
model_output2 = model(**inputs2)
embs1, embs2 = cls_pooling(model_output1), cls_pooling(model_output2)
embs1 = torch.nn.functional.normalize(embs1, p=2, dim=1).numpy()
embs2 = torch.nn.functional.normalize(embs2, p=2, dim=1).numpy()
# 计算两两相似度
similarity = embs1 @ embs2.T
print(similarity)
# 获取 texts1[i] 对应的最相似 texts2[j]
for i in range(len(texts1)):
scores = []
for j in range(len(texts2)):
scores.append([texts2[j], similarity[i][j]])
scores = sorted(scores, key=lambda x:x[1], reverse=True)
print(f"查询文本:{texts1[i]}")
for text2, score in scores:
print(f"相似文本:{text2},打分:{score}")
print()
Contact
您如果在使用过程中,遇到任何问题,欢迎前往讨论区建言献策。
您也可以联系我们:赵中昊 zhongh@dmetasoul.com, 肖文斌 xiaowenbin@dmetasoul.com, 孙凯 sunkai@dmetasoul.com
同时我们也开通了微信群,可扫码加入我们(人数超200了,先加管理员再拉进群),一起共建 AIGC 技术生态!
License
Dmeta-embedding 系列模型采用 Apache-2.0 License,开源模型可以进行免费商用私有部署。
- Downloads last month
- 1,253
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for DMetaSoul/Dmeta-embedding-zh-small
Spaces using DMetaSoul/Dmeta-embedding-zh-small 2
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported55.384
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported59.543
- euclidean_pearson on MTEB AFQMCvalidation set self-reported58.186
- euclidean_spearman on MTEB AFQMCvalidation set self-reported59.543
- manhattan_pearson on MTEB AFQMCvalidation set self-reported58.142
- manhattan_spearman on MTEB AFQMCvalidation set self-reported59.479
- cos_sim_pearson on MTEB ATECtest set self-reported55.969
- cos_sim_spearman on MTEB ATECtest set self-reported58.633
- euclidean_pearson on MTEB ATECtest set self-reported62.784
- euclidean_spearman on MTEB ATECtest set self-reported58.633