Update README.md
Browse files
README.md
CHANGED
@@ -1059,7 +1059,43 @@ license: mit
|
|
1059 |
library_name: sentence-transformers
|
1060 |
---
|
1061 |
<h2 align="left">ZPoint Large Embedding for Chinese</h2>
|
1062 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1063 |
|
1064 |
```python
|
1065 |
from sentence_transformers import SentenceTransformer
|
|
|
1059 |
library_name: sentence-transformers
|
1060 |
---
|
1061 |
<h2 align="left">ZPoint Large Embedding for Chinese</h2>
|
1062 |
+
|
1063 |
+
**[2024-06-04]** Release zpoint_large_embedding_zh, and upload model weight to huggingface
|
1064 |
+
**[2024-06-05]** Add training details
|
1065 |
+
|
1066 |
+
### Training Details
|
1067 |
+
|
1068 |
+
**Base Model**
|
1069 |
+
1) We chose [Stella](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as our base model.
|
1070 |
+
|
1071 |
+
**Training Data**
|
1072 |
+
1) **Hard negative samping**
|
1073 |
+
- For retrieval task, We sampled 10 hard negative passages/answers from top50-top200 related passages/answers for each query.
|
1074 |
+
- For classification/clustering tasks, we sampled 5 hard negative samples from other classes/cluster for each sample.
|
1075 |
+
- For classification/clustering tasks, we also used the category names of each class and cluster as positive and negative samples.
|
1076 |
+
|
1077 |
+
2) **Data synthesis by LLM (Qwen1.5-72B)**
|
1078 |
+
- For retrieval tasks, we used LLM to rewrite each query, generating five different rewritten results.
|
1079 |
+
- For retrieval tasks, we also generated five new queries for some documents by LLM.
|
1080 |
+
- For non-retrieval tasks, we used LLM to rewrite the queries, generating five rewritten results for each query.
|
1081 |
+
- Finally, total amount of synthesized data is about 30 million.
|
1082 |
+
|
1083 |
+
3) **Collect more data for retrieval-type tasks**
|
1084 |
+
- We constructed a dataset of approximately 100 million training samples through collection, machine translation, and LLM synthesis. This dataset includes data from various fields such as healthcare, law, electricity, automotive, and 3C (Consumer Electronics).
|
1085 |
+
- [miracl/miracl](https://huggingface.co/datasets/miracl/miracl)
|
1086 |
+
- [FreedomIntelligence/Huatuo26M-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)
|
1087 |
+
- [PaddlePaddle/dureader_robust](https://huggingface.co/datasets/PaddlePaddle/dureader_robust) **C-MTEB test filtered**
|
1088 |
+
- [THUIR/T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking) **C-MTEB test filtered**
|
1089 |
+
- [Shitao/bge-reranker-data](https://huggingface.co/datasets/Shitao/bge-reranker-data)
|
1090 |
+
- [Shitao/bge-reranker-data](https://huggingface.co/datasets/Shitao/MLDR)
|
1091 |
+
- ...
|
1092 |
+
|
1093 |
+
**Training loss**
|
1094 |
+
1) Multi-Task loss like [Piccolo](https://huggingface.co/sensenova/piccolo-large-zh-v2)
|
1095 |
+
2) Matryoshka Representation Learning
|
1096 |
+
|
1097 |
+
|
1098 |
+
### Example
|
1099 |
|
1100 |
```python
|
1101 |
from sentence_transformers import SentenceTransformer
|