akiFQCint commited on
Commit
e1356e8
1 Parent(s): de99cf6

update readme

Browse files
Files changed (2) hide show
  1. README.md +19 -16
  2. README_JA.md +19 -16
README.md CHANGED
@@ -20,6 +20,9 @@ datasets:
20
  - sentence-transformers/NQ-retrieval
21
  - sbintuitions/JSQuAD
22
  - SkelterLabsInc/JaQuAD
 
 
 
23
  ---
24
 
25
  # Sarashina-Embedding-v1-1B
@@ -102,17 +105,17 @@ To achieve generic text embedding performance across a wide range of domains, we
102
 
103
  |dataset|counts|
104
  |:-:|:-:|
105
- |AutoWikiQA|50,521,135|
106
- |web-crawled data|47,370,649|
107
- |MQA|12,941,472|
108
- |llm-japanese-dataset|9,074,340|
109
- |wikipedia|5,555,212|
110
- |Quiz dataset|988,478|
111
- |Natural Questions|132,796|
112
- |JSQuAD|62,859|
113
- |snow|62,758|
114
- |JaQuAD|31,746|
115
- |mkqa|3,318|
116
  |||
117
  |**total**|**126,744,763**|
118
 
@@ -124,10 +127,10 @@ To enable the model to learn a more accurate query-document similarity, we perfo
124
 
125
  |dataset|counts|
126
  |:-:|:-:|
127
- |JSNLI|141,388 |
128
- |NU-MNLI|67,987|
129
- |Mr. TyDi (only Japanese subset)| 3,697 |
130
- |Natural Question (sampled)| 20,000|
131
  |||
132
  |**total**|**233,072**|
133
 
@@ -135,7 +138,7 @@ To enable the model to learn a more accurate query-document similarity, we perfo
135
 
136
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
137
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
138
- | OpenAI/text-embedding-3-large | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
139
  | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
140
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
141
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
 
20
  - sentence-transformers/NQ-retrieval
21
  - sbintuitions/JSQuAD
22
  - SkelterLabsInc/JaQuAD
23
+ - wikimedia/wikipedia
24
+ - cl-nagoya/nu-mnli
25
+ - castorini/mr-tydi
26
  ---
27
 
28
  # Sarashina-Embedding-v1-1B
 
105
 
106
  |dataset|counts|
107
  |:-:|:-:|
108
+ |[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135|
109
+ |web-crawled data (ours)|47,370,649|
110
+ |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
111
+ |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
112
+ |[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
113
+ |Quiz dataset (ours)|988,478|
114
+ |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
115
+ |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
116
+ |[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
117
+ |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
118
+ |[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
119
  |||
120
  |**total**|**126,744,763**|
121
 
 
127
 
128
  |dataset|counts|
129
  |:-:|:-:|
130
+ |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
131
+ |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
132
+ |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi/tree/main) (only Japanese subset)| 3,697 |
133
+ |[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
134
  |||
135
  |**total**|**233,072**|
136
 
 
138
 
139
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
140
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
141
+ | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)^1 | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
142
  | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
143
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
144
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
README_JA.md CHANGED
@@ -18,6 +18,9 @@ datasets:
18
  - sentence-transformers/NQ-retrieval
19
  - sbintuitions/JSQuAD
20
  - SkelterLabsInc/JaQuAD
 
 
 
21
  ---
22
 
23
  # Sarashina-Embedding-v1-1B
@@ -99,17 +102,17 @@ print(similarities.shape)
99
 
100
  |dataset|counts|
101
  |:-:|:-:|
102
- |AutoWikiQA|50,521,135|
103
- |web-crawled data|47,370,649|
104
- |MQA|12,941,472|
105
- |llm-japanese-dataset|9,074,340|
106
- |wikipedia|5,555,212|
107
- |Quiz dataset|988,478|
108
- |Natural Questions|132,796|
109
- |JSQuAD|62,859|
110
- |snow|62,758|
111
- |JaQuAD|31,746|
112
- |mkqa|3,318|
113
  |||
114
  |**total**|**126,744,763**|
115
 
@@ -121,10 +124,10 @@ print(similarities.shape)
121
 
122
  |dataset|counts|
123
  |:-:|:-:|
124
- |JSNLI|141,388 |
125
- |NU-MNLI|67,987|
126
- |Mr. TyDi (only Japanese subset)| 3,697 |
127
- |Natural Question (sampled)| 20,000|
128
  |||
129
  |**total**|**233,072**|
130
 
@@ -132,7 +135,7 @@ print(similarities.shape)
132
 
133
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
134
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
135
- | OpenAI/text-embedding-3-large | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
136
  | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
137
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
138
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
 
18
  - sentence-transformers/NQ-retrieval
19
  - sbintuitions/JSQuAD
20
  - SkelterLabsInc/JaQuAD
21
+ - wikimedia/wikipedia
22
+ - cl-nagoya/nu-mnli
23
+ - castorini/mr-tydi
24
  ---
25
 
26
  # Sarashina-Embedding-v1-1B
 
102
 
103
  |dataset|counts|
104
  |:-:|:-:|
105
+ |[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135|
106
+ |web-crawled data (ours)|47,370,649|
107
+ |[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
108
+ |[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
109
+ |[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
110
+ |Quiz dataset (ours)|988,478|
111
+ |[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
112
+ |[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
113
+ |[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
114
+ |[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
115
+ |[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
116
  |||
117
  |**total**|**126,744,763**|
118
 
 
124
 
125
  |dataset|counts|
126
  |:-:|:-:|
127
+ |[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
128
+ |[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
129
+ |[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi/tree/main) (only Japanese subset)| 3,697 |
130
+ |[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
131
  |||
132
  |**total**|**233,072**|
133
 
 
135
 
136
  Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
137
  |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
138
+ | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)^1 | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
139
  | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
140
  | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
141
  | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |