pawasthy commited on
Commit
ef6db93
1 Parent(s): 0771942

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +852 -3
README.md CHANGED
@@ -1,3 +1,852 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ar
5
+ - cs
6
+ - de
7
+ - es
8
+ - fr
9
+ - it
10
+ - ja
11
+ - ko
12
+ - nl
13
+ - pt
14
+ - zh
15
+ license: apache-2.0
16
+ library_name: transformers
17
+ tags:
18
+ - language
19
+ - granite
20
+ - embeddings
21
+ - multilingual
22
+ model-index:
23
+ - name: ibm-granite/granite-embedding-107m-multilingual
24
+ results:
25
+ - dataset:
26
+ type: miracl/mmteb-miracl
27
+ name: Miracl (en)
28
+ config: en
29
+ split: dev
30
+ task:
31
+ type: Retrieval
32
+ metrics:
33
+ - type: ndcg_at_1
34
+ value: 0.41176
35
+ - type: ndcg_at_10
36
+ value: 0.46682
37
+ - type: ndcg_at_100
38
+ value: 0.54326
39
+ - type: ndcg_at_1000
40
+ value: 0.56567
41
+ - type: ndcg_at_20
42
+ value: 0.50157
43
+ - type: ndcg_at_3
44
+ value: 0.41197
45
+ - type: ndcg_at_5
46
+ value: 0.42086
47
+ - type: recall_at_1
48
+ value: 0.19322
49
+ - type: recall_at_10
50
+ value: 0.57721
51
+ - type: recall_at_100
52
+ value: 0.83256
53
+ - type: recall_at_1000
54
+ value: 0.95511
55
+ - type: recall_at_20
56
+ value: 0.6757
57
+ - type: recall_at_3
58
+ value: 0.37171
59
+ - type: recall_at_5
60
+ value: 0.44695
61
+ - dataset:
62
+ type: miracl/mmteb-miracl
63
+ name: Miracl (ar)
64
+ config: ar
65
+ split: dev
66
+ task:
67
+ type: Retrieval
68
+ metrics:
69
+ - type: ndcg_at_1
70
+ value: 0.55559
71
+ - type: ndcg_at_10
72
+ value: 0.62541
73
+ - type: ndcg_at_100
74
+ value: 0.67101
75
+ - type: ndcg_at_1000
76
+ value: 0.6805
77
+ - type: ndcg_at_20
78
+ value: 0.64739
79
+ - type: ndcg_at_3
80
+ value: 0.56439
81
+ - type: ndcg_at_5
82
+ value: 0.59347
83
+ - type: recall_at_1
84
+ value: 0.37009
85
+ - type: recall_at_10
86
+ value: 0.73317
87
+ - type: recall_at_100
88
+ value: 0.90066
89
+ - type: recall_at_1000
90
+ value: 0.96272
91
+ - type: recall_at_20
92
+ value: 0.80205
93
+ - type: recall_at_3
94
+ value: 0.56903
95
+ - type: recall_at_5
96
+ value: 0.6518
97
+ - dataset:
98
+ type: miracl/mmteb-miracl
99
+ name: Miracl (bn)
100
+ config: bn
101
+ split: dev
102
+ task:
103
+ type: Retrieval
104
+ metrics:
105
+ - type: ndcg_at_1
106
+ value: 0.56691
107
+ - type: ndcg_at_10
108
+ value: 0.65484
109
+ - type: ndcg_at_100
110
+ value: 0.70142
111
+ - type: ndcg_at_1000
112
+ value: 0.70994
113
+ - type: ndcg_at_20
114
+ value: 0.67838
115
+ - type: ndcg_at_3
116
+ value: 0.5988
117
+ - type: ndcg_at_5
118
+ value: 0.62718
119
+ - type: recall_at_1
120
+ value: 0.3605
121
+ - type: recall_at_10
122
+ value: 0.76854
123
+ - type: recall_at_100
124
+ value: 0.9285
125
+ - type: recall_at_1000
126
+ value: 0.97928
127
+ - type: recall_at_20
128
+ value: 0.83667
129
+ - type: recall_at_3
130
+ value: 0.61596
131
+ - type: recall_at_5
132
+ value: 0.69766
133
+ - dataset:
134
+ type: miracl/mmteb-miracl
135
+ name: Miracl (de)
136
+ config: de
137
+ split: dev
138
+ task:
139
+ type: Retrieval
140
+ metrics:
141
+ - type: ndcg_at_1
142
+ value: 0.41967
143
+ - type: ndcg_at_10
144
+ value: 0.45141
145
+ - type: ndcg_at_100
146
+ value: 0.53461
147
+ - type: ndcg_at_1000
148
+ value: 0.55463
149
+ - type: ndcg_at_20
150
+ value: 0.49012
151
+ - type: ndcg_at_3
152
+ value: 0.39486
153
+ - type: ndcg_at_5
154
+ value: 0.41496
155
+ - type: recall_at_1
156
+ value: 0.19494
157
+ - type: recall_at_10
158
+ value: 0.53774
159
+ - type: recall_at_100
160
+ value: 0.83314
161
+ - type: recall_at_1000
162
+ value: 0.95045
163
+ - type: recall_at_20
164
+ value: 0.65659
165
+ - type: recall_at_3
166
+ value: 0.3556
167
+ - type: recall_at_5
168
+ value: 0.44448
169
+ - dataset:
170
+ type: miracl/mmteb-miracl
171
+ name: Miracl (es)
172
+ config: es
173
+ split: dev
174
+ task:
175
+ type: Retrieval
176
+ metrics:
177
+ - type: ndcg_at_1
178
+ value: 0.54475
179
+ - type: ndcg_at_10
180
+ value: 0.46593
181
+ - type: ndcg_at_100
182
+ value: 0.58079
183
+ - type: ndcg_at_1000
184
+ value: 0.60656
185
+ - type: ndcg_at_20
186
+ value: 0.51858
187
+ - type: ndcg_at_3
188
+ value: 0.4578
189
+ - type: ndcg_at_5
190
+ value: 0.44321
191
+ - type: recall_at_1
192
+ value: 0.15966
193
+ - type: recall_at_10
194
+ value: 0.49343
195
+ - type: recall_at_100
196
+ value: 0.82684
197
+ - type: recall_at_1000
198
+ value: 0.95299
199
+ - type: recall_at_20
200
+ value: 0.62367
201
+ - type: recall_at_3
202
+ value: 0.2949
203
+ - type: recall_at_5
204
+ value: 0.37983
205
+ - dataset:
206
+ type: miracl/mmteb-miracl
207
+ name: Miracl (fa)
208
+ config: fa
209
+ split: dev
210
+ task:
211
+ type: Retrieval
212
+ metrics:
213
+ - type: ndcg_at_1
214
+ value: 0.36709
215
+ - type: ndcg_at_10
216
+ value: 0.46961
217
+ - type: ndcg_at_100
218
+ value: 0.53262
219
+ - type: ndcg_at_1000
220
+ value: 0.55024
221
+ - type: ndcg_at_20
222
+ value: 0.49892
223
+ - type: ndcg_at_3
224
+ value: 0.40235
225
+ - type: ndcg_at_5
226
+ value: 0.42866
227
+ - type: recall_at_1
228
+ value: 0.22735
229
+ - type: recall_at_10
230
+ value: 0.59949
231
+ - type: recall_at_100
232
+ value: 0.83867
233
+ - type: recall_at_1000
234
+ value: 0.95007
235
+ - type: recall_at_20
236
+ value: 0.68947
237
+ - type: recall_at_3
238
+ value: 0.41781
239
+ - type: recall_at_5
240
+ value: 0.49374
241
+ - dataset:
242
+ type: miracl/mmteb-miracl
243
+ name: Miracl (fi)
244
+ config: fi
245
+ split: dev
246
+ task:
247
+ type: Retrieval
248
+ metrics:
249
+ - type: ndcg_at_1
250
+ value: 0.59245
251
+ - type: ndcg_at_10
252
+ value: 0.65551
253
+ - type: ndcg_at_100
254
+ value: 0.6967
255
+ - type: ndcg_at_1000
256
+ value: 0.70521
257
+ - type: ndcg_at_20
258
+ value: 0.67552
259
+ - type: ndcg_at_3
260
+ value: 0.58876
261
+ - type: ndcg_at_5
262
+ value: 0.61779
263
+ - type: recall_at_1
264
+ value: 0.37669
265
+ - type: recall_at_10
266
+ value: 0.76529
267
+ - type: recall_at_100
268
+ value: 0.9156
269
+ - type: recall_at_1000
270
+ value: 0.96977
271
+ - type: recall_at_20
272
+ value: 0.82685
273
+ - type: recall_at_3
274
+ value: 0.60234
275
+ - type: recall_at_5
276
+ value: 0.67135
277
+ - dataset:
278
+ type: miracl/mmteb-miracl
279
+ name: Miracl (fr)
280
+ config: fr
281
+ split: dev
282
+ task:
283
+ type: Retrieval
284
+ metrics:
285
+ - type: ndcg_at_1
286
+ value: 0.38776
287
+ - type: ndcg_at_10
288
+ value: 0.47589
289
+ - type: ndcg_at_100
290
+ value: 0.54641
291
+ - type: ndcg_at_1000
292
+ value: 0.5629
293
+ - type: ndcg_at_20
294
+ value: 0.51203
295
+ - type: ndcg_at_3
296
+ value: 0.38924
297
+ - type: ndcg_at_5
298
+ value: 0.42572
299
+ - type: recall_at_1
300
+ value: 0.22082
301
+ - type: recall_at_10
302
+ value: 0.61619
303
+ - type: recall_at_100
304
+ value: 0.87237
305
+ - type: recall_at_1000
306
+ value: 0.97449
307
+ - type: recall_at_20
308
+ value: 0.72689
309
+ - type: recall_at_3
310
+ value: 0.39527
311
+ - type: recall_at_5
312
+ value: 0.48983
313
+ - dataset:
314
+ type: miracl/mmteb-miracl
315
+ name: Miracl (hi)
316
+ config: hi
317
+ split: dev
318
+ task:
319
+ type: Retrieval
320
+ metrics:
321
+ - type: ndcg_at_1
322
+ value: 0.33143
323
+ - type: ndcg_at_10
324
+ value: 0.42084
325
+ - type: ndcg_at_100
326
+ value: 0.48647
327
+ - type: ndcg_at_1000
328
+ value: 0.50712
329
+ - type: ndcg_at_20
330
+ value: 0.45399
331
+ - type: ndcg_at_3
332
+ value: 0.34988
333
+ - type: ndcg_at_5
334
+ value: 0.37938
335
+ - type: recall_at_1
336
+ value: 0.17852
337
+ - type: recall_at_10
338
+ value: 0.55217
339
+ - type: recall_at_100
340
+ value: 0.79929
341
+ - type: recall_at_1000
342
+ value: 0.93434
343
+ - type: recall_at_20
344
+ value: 0.65231
345
+ - type: recall_at_3
346
+ value: 0.33765
347
+ - type: recall_at_5
348
+ value: 0.43828
349
+ - dataset:
350
+ type: miracl/mmteb-miracl
351
+ name: Miracl (id)
352
+ config: id
353
+ split: dev
354
+ task:
355
+ type: Retrieval
356
+ metrics:
357
+ - type: ndcg_at_1
358
+ value: 0.43854
359
+ - type: ndcg_at_10
360
+ value: 0.45459
361
+ - type: ndcg_at_100
362
+ value: 0.53643
363
+ - type: ndcg_at_1000
364
+ value: 0.56052
365
+ - type: ndcg_at_20
366
+ value: 0.48795
367
+ - type: ndcg_at_3
368
+ value: 0.41041
369
+ - type: ndcg_at_5
370
+ value: 0.42235
371
+ - type: recall_at_1
372
+ value: 0.19193
373
+ - type: recall_at_10
374
+ value: 0.5289
375
+ - type: recall_at_100
376
+ value: 0.79649
377
+ - type: recall_at_1000
378
+ value: 0.92937
379
+ - type: recall_at_20
380
+ value: 0.61813
381
+ - type: recall_at_3
382
+ value: 0.35431
383
+ - type: recall_at_5
384
+ value: 0.43348
385
+ - dataset:
386
+ type: miracl/mmteb-miracl
387
+ name: Miracl (ja)
388
+ config: ja
389
+ split: dev
390
+ task:
391
+ type: Retrieval
392
+ metrics:
393
+ - type: ndcg_at_1
394
+ value: 0.53256
395
+ - type: ndcg_at_10
396
+ value: 0.59922
397
+ - type: ndcg_at_100
398
+ value: 0.65407
399
+ - type: ndcg_at_1000
400
+ value: 0.66484
401
+ - type: ndcg_at_20
402
+ value: 0.62596
403
+ - type: ndcg_at_3
404
+ value: 0.53717
405
+ - type: ndcg_at_5
406
+ value: 0.56523
407
+ - type: recall_at_1
408
+ value: 0.34555
409
+ - type: recall_at_10
410
+ value: 0.71476
411
+ - type: recall_at_100
412
+ value: 0.91152
413
+ - type: recall_at_1000
414
+ value: 0.97728
415
+ - type: recall_at_20
416
+ value: 0.79811
417
+ - type: recall_at_3
418
+ value: 0.53482
419
+ - type: recall_at_5
420
+ value: 0.62327
421
+ - dataset:
422
+ type: miracl/mmteb-miracl
423
+ name: Miracl (ko)
424
+ config: ko
425
+ split: dev
426
+ task:
427
+ type: Retrieval
428
+ metrics:
429
+ - type: ndcg_at_1
430
+ value: 0.5493
431
+ - type: ndcg_at_10
432
+ value: 0.58413
433
+ - type: ndcg_at_100
434
+ value: 0.64374
435
+ - type: ndcg_at_1000
436
+ value: 0.65655
437
+ - type: ndcg_at_20
438
+ value: 0.61732
439
+ - type: ndcg_at_3
440
+ value: 0.53068
441
+ - type: ndcg_at_5
442
+ value: 0.55202
443
+ - type: recall_at_1
444
+ value: 0.32602
445
+ - type: recall_at_10
446
+ value: 0.68647
447
+ - type: recall_at_100
448
+ value: 0.87746
449
+ - type: recall_at_1000
450
+ value: 0.95524
451
+ - type: recall_at_20
452
+ value: 0.78089
453
+ - type: recall_at_3
454
+ value: 0.49173
455
+ - type: recall_at_5
456
+ value: 0.5827
457
+ - dataset:
458
+ type: miracl/mmteb-miracl
459
+ name: Miracl (ru)
460
+ config: ru
461
+ split: dev
462
+ task:
463
+ type: Retrieval
464
+ metrics:
465
+ - type: ndcg_at_1
466
+ value: 0.43131
467
+ - type: ndcg_at_10
468
+ value: 0.48262
469
+ - type: ndcg_at_100
470
+ value: 0.56158
471
+ - type: ndcg_at_1000
472
+ value: 0.57929
473
+ - type: ndcg_at_20
474
+ value: 0.52023
475
+ - type: ndcg_at_3
476
+ value: 0.42808
477
+ - type: ndcg_at_5
478
+ value: 0.44373
479
+ - type: recall_at_1
480
+ value: 0.22018
481
+ - type: recall_at_10
482
+ value: 0.58034
483
+ - type: recall_at_100
484
+ value: 0.84074
485
+ - type: recall_at_1000
486
+ value: 0.93938
487
+ - type: recall_at_20
488
+ value: 0.68603
489
+ - type: recall_at_3
490
+ value: 0.39307
491
+ - type: recall_at_5
492
+ value: 0.47077
493
+ - dataset:
494
+ type: miracl/mmteb-miracl
495
+ name: Miracl (sw)
496
+ config: sw
497
+ split: dev
498
+ task:
499
+ type: Retrieval
500
+ metrics:
501
+ - type: ndcg_at_1
502
+ value: 0.50415
503
+ - type: ndcg_at_10
504
+ value: 0.59111
505
+ - type: ndcg_at_100
506
+ value: 0.64312
507
+ - type: ndcg_at_1000
508
+ value: 0.65089
509
+ - type: ndcg_at_20
510
+ value: 0.61651
511
+ - type: ndcg_at_3
512
+ value: 0.5304
513
+ - type: ndcg_at_5
514
+ value: 0.56139
515
+ - type: recall_at_1
516
+ value: 0.33267
517
+ - type: recall_at_10
518
+ value: 0.72082
519
+ - type: recall_at_100
520
+ value: 0.91377
521
+ - type: recall_at_1000
522
+ value: 0.96152
523
+ - type: recall_at_20
524
+ value: 0.79943
525
+ - type: recall_at_3
526
+ value: 0.5548
527
+ - type: recall_at_5
528
+ value: 0.64302
529
+ - dataset:
530
+ type: miracl/mmteb-miracl
531
+ name: Miracl (te)
532
+ config: te
533
+ split: dev
534
+ task:
535
+ type: Retrieval
536
+ metrics:
537
+ - type: ndcg_at_1
538
+ value: 0.64372
539
+ - type: ndcg_at_10
540
+ value: 0.78175
541
+ - type: ndcg_at_100
542
+ value: 0.79523
543
+ - type: ndcg_at_1000
544
+ value: 0.79774
545
+ - type: ndcg_at_20
546
+ value: 0.78826
547
+ - type: ndcg_at_3
548
+ value: 0.74856
549
+ - type: ndcg_at_5
550
+ value: 0.77128
551
+ - type: recall_at_1
552
+ value: 0.63688
553
+ - type: recall_at_10
554
+ value: 0.90358
555
+ - type: recall_at_100
556
+ value: 0.96558
557
+ - type: recall_at_1000
558
+ value: 0.9847
559
+ - type: recall_at_20
560
+ value: 0.92834
561
+ - type: recall_at_3
562
+ value: 0.81804
563
+ - type: recall_at_5
564
+ value: 0.87198
565
+ - dataset:
566
+ type: miracl/mmteb-miracl
567
+ name: Miracl (th)
568
+ config: th
569
+ split: dev
570
+ task:
571
+ type: Retrieval
572
+ metrics:
573
+ - type: ndcg_at_1
574
+ value: 0.65484
575
+ - type: ndcg_at_10
576
+ value: 0.71774
577
+ - type: ndcg_at_100
578
+ value: 0.75362
579
+ - type: ndcg_at_1000
580
+ value: 0.75898
581
+ - type: ndcg_at_20
582
+ value: 0.73709
583
+ - type: ndcg_at_3
584
+ value: 0.66199
585
+ - type: ndcg_at_5
586
+ value: 0.68451
587
+ - type: recall_at_1
588
+ value: 0.45911
589
+ - type: recall_at_10
590
+ value: 0.82619
591
+ - type: recall_at_100
592
+ value: 0.95515
593
+ - type: recall_at_1000
594
+ value: 0.98854
595
+ - type: recall_at_20
596
+ value: 0.88447
597
+ - type: recall_at_3
598
+ value: 0.67437
599
+ - type: recall_at_5
600
+ value: 0.73786
601
+ - dataset:
602
+ type: miracl/mmteb-miracl
603
+ name: Miracl (yo)
604
+ config: yo
605
+ split: dev
606
+ task:
607
+ type: Retrieval
608
+ metrics:
609
+ - type: ndcg_at_1
610
+ value: 0.46218
611
+ - type: ndcg_at_10
612
+ value: 0.64685
613
+ - type: ndcg_at_100
614
+ value: 0.66941
615
+ - type: ndcg_at_1000
616
+ value: 0.67361
617
+ - type: ndcg_at_20
618
+ value: 0.65548
619
+ - type: ndcg_at_3
620
+ value: 0.57609
621
+ - type: ndcg_at_5
622
+ value: 0.62021
623
+ - type: recall_at_1
624
+ value: 0.42787
625
+ - type: recall_at_10
626
+ value: 0.82913
627
+ - type: recall_at_100
628
+ value: 0.93277
629
+ - type: recall_at_1000
630
+ value: 0.96499
631
+ - type: recall_at_20
632
+ value: 0.85994
633
+ - type: recall_at_3
634
+ value: 0.65406
635
+ - type: recall_at_5
636
+ value: 0.7542
637
+ - dataset:
638
+ type: miracl/mmteb-miracl
639
+ name: Miracl (zh)
640
+ config: zh
641
+ split: dev
642
+ task:
643
+ type: Retrieval
644
+ metrics:
645
+ - type: ndcg_at_1
646
+ value: 0.41985
647
+ - type: ndcg_at_10
648
+ value: 0.4837
649
+ - type: ndcg_at_100
650
+ value: 0.55961
651
+ - type: ndcg_at_1000
652
+ value: 0.5762
653
+ - type: ndcg_at_20
654
+ value: 0.51595
655
+ - type: ndcg_at_3
656
+ value: 0.42094
657
+ - type: ndcg_at_5
658
+ value: 0.44273
659
+ - type: recall_at_1
660
+ value: 0.21446
661
+ - type: recall_at_10
662
+ value: 0.59695
663
+ - type: recall_at_100
664
+ value: 0.87388
665
+ - type: recall_at_1000
666
+ value: 0.96833
667
+ - type: recall_at_20
668
+ value: 0.69252
669
+ - type: recall_at_3
670
+ value: 0.40377
671
+ - type: recall_at_5
672
+ value: 0.4903
673
+ ---
674
+ # Granite-Embedding-107m-multilingual
675
+
676
+ **Model Summary:**
677
+ Granite-Embedding-107M-Multilingual is a 107M parameter dense biencoder embedding model from the Granite Embeddings suite that can be used to generate high quality text embeddings. This model produces embedding vectors of size 384 and is trained using a combination of open source relevance-pair datasets with permissive, enterprise-friendly license, and IBM collected and generated datasets. This model is developed using contrastive finetuning, knowledge distillation and model merging for improved performance.
678
+
679
+ - **Developers:** Granite Embedding Team, IBM
680
+ - **GitHub Repository:** [ibm-granite/granite-embedding-models](https://github.com/ibm-granite/granite-embedding-models)
681
+ - **Website**: [Granite Docs](https://www.ibm.com/granite/docs/)
682
+ - **Paper:** Coming Soon
683
+ - **Release Date**: December 18th, 2024
684
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
685
+
686
+ **Supported Languages:**
687
+ English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite-Embedding-107M-Multilingual for languages beyond these 12 languages.
688
+
689
+ **Intended use:**
690
+ The model is designed to produce fixed length vector representations for a given text, which can be used for text similarity, retrieval, and search applications.
691
+
692
+ **Usage with Sentence Transformers:**
693
+ The model is compatible with SentenceTransformer library and is very easy to use:
694
+
695
+ First, install the sentence transformers library
696
+ ```shell
697
+ pip install sentence_transformers
698
+ ```
699
+
700
+ The model can then be used to encode pairs of text and find the similarity between their representations
701
+
702
+ ```python
703
+ from sentence_transformers import SentenceTransformer, util
704
+
705
+ model_path = "ibm-granite/granite-embedding-107m-multilingual"
706
+ # Load the Sentence Transformer model
707
+ model = SentenceTransformer(model_path)
708
+
709
+ input_queries = [
710
+ ' Who made the song My achy breaky heart? ',
711
+ 'summit define'
712
+ ]
713
+
714
+ input_passages = [
715
+ "Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991. ",
716
+ "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
717
+ ]
718
+
719
+ # encode queries and passages
720
+ query_embeddings = model.encode(input_queries)
721
+ passage_embeddings = model.encode(input_passages)
722
+
723
+ # calculate cosine similarity
724
+ print(util.cos_sim(query_embeddings, passage_embeddings))
725
+ ```
726
+
727
+ **Usage with Huggingface Transformers:**
728
+ This is a simple example of how to use the Granite-Embedding-107m-Multilingual model with the Transformers library and PyTorch.
729
+
730
+ First, install the required libraries
731
+ ```shell
732
+ pip install transformers torch
733
+ ```
734
+
735
+ The model can then be used to encode pairs of text
736
+
737
+ ```python
738
+ import torch
739
+ from transformers import AutoModel, AutoTokenizer
740
+
741
+ model_path = "ibm-granite/granite-embedding-107m-multilingual"
742
+
743
+ # Load the model and tokenizer
744
+ model = AutoModel.from_pretrained(model_path)
745
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
746
+ model.eval()
747
+
748
+ input_queries = [
749
+ ' Who made the song My achy breaky heart? ',
750
+ 'summit define'
751
+ ]
752
+
753
+ # tokenize inputs
754
+ tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')
755
+
756
+ # encode queries
757
+ with torch.no_grad():
758
+ # Queries
759
+ model_output = model(**tokenized_queries)
760
+ # Perform pooling. granite-embedding-30m-english uses CLS Pooling
761
+ query_embeddings = model_output[0][:, 0]
762
+
763
+ # normalize the embeddings
764
+ query_embeddings = torch.nn.functional.normalize(query_embeddings, dim=1)
765
+
766
+ ```
767
+
768
+ **Evaluation:**
769
+ The average performance of the Granite-Embedding-107M-Multilingual on Multilingual Miracl (across 18 langauges), Mintaka Retrieval (across 8 languages) and MTEB Retrieval for English (across 15 tasks), German (across 4 tasks), Spanish (across 2 tasks), Frenc (across 5 tasks), Japanese (across 2 tasks), Arabic (1 task), Korean (1 task) and Chinese (across 8 tasks) is reported below. Granite-Embedding-107M-Multilingual is twice as fast as other models with similar embedding dimensions.
770
+
771
+ | Model | Paramters (M)| Embedding Dimension | Miracl (18) | Mintaka Retrieval (8) | MTEB English (15) | MTEB German (4) |MTEB Spanish (2) | MTEB French (5) | MTEB Japanese (2) | MTEB Arabic (1) | MTEB Korean (1) | MTEB Chinese (8) |
772
+ |------------------------------------|:------------:|:-------------------:|:-------------:| :---------------------:|:-----------------:|:---------------:|:---------------:|:---------------:|:----------------:|:----------------:|----------------:|-----------------:|
773
+ |granite-embedding-107m-multilingual | 107 | 384 | 55.9 | 22.6 | 45.3 | 70.3 | 48.7 | 51.1 | 59.0 | 63.2 | 70.5 | 40.8 |
774
+
775
+ **Model Architecture:**
776
+ Granite-Embedding-107m-Multilingual is based on an encoder-only XLM-RoBERTa like transformer architecture, trained internally at IBM Research.
777
+
778
+ | Model | granite-embedding-30m-english | granite-embedding-125m-english | granite-embedding-107m-multilingual | granite-embedding-278m-multilingual |
779
+ | :--------- | :-------:| :--------: | :---------:| :-----:|
780
+ | Embedding size | 384 | 768 | **384** | 768 |
781
+ | Number of layers | 6 | 12 | **6** | 12 |
782
+ | Number of attention heads | 12 | 12 | **12** | 12 |
783
+ | Intermediate size | 1536 | 3072 | **1536** | 3072 |
784
+ | Activation Function | GeLU | GeLU | **GeLU** | GeLU |
785
+ | Vocabulary Size | 50265 | 50265 | **250002** | 250002 |
786
+ | Max. Sequence Length | 512 | 512 | **512** | 512 |
787
+ | # Parameters | 30M | 125M | **107M** | 278M |
788
+
789
+
790
+ **Training Data:**
791
+ Overall, the training data consists of four key sources: (1) unsupervised title-body paired data scraped from the web, (2) publicly available paired with permissive, enterprise-friendly license, (3) IBM-internal paired data targetting specific technical domains, and (4) IBM-generated synthetic data. The data is listed below:
792
+
793
+ | **Dataset** | **Num. Pairs** |
794
+ |:--------------------------------------------------------------------------|:--------------:|
795
+ | Multilingual MC4 | 52,823,484 |
796
+ | Multilingual Webhose | 12,369,322 |
797
+ | English Wikipedia | 20,745,403 |
798
+ | Multilingual Wikimedia | 2,911,090 |
799
+ | Miracl Corpus (Title-Body) | 10,120,398 |
800
+ | Stack Exchange Duplicate questions (titles) | 304,525 |
801
+ | Stack Exchange Duplicate questions (titles) | 304,525 |
802
+ | Stack Exchange Duplicate questions (bodies) | 250,519 |
803
+ | Machine Translations of Stack Exchange Duplicate questions (titles) | 187,195 |
804
+ | Stack Exchange (Title, Answer) pairs | 4,067,139 |
805
+ | Stack Exchange (Title, Body) pairs | 23,978,013 |
806
+ | Stack Exchange (Title, Body) pairs | 23,978,013 |
807
+ | Machine Translations of Stack Exchange (Title+Body, Answer) pairs | 1,827,15 |
808
+ | SearchQA | 582,261 |
809
+ | S2ORC (Title, Abstract) | 41,769,185 |
810
+ | WikiAnswers Duplicate question pairs | 77,427,422 |
811
+ | CCNews | 614,664 |
812
+ | XSum | 226,711 |
813
+ | SimpleWiki | 102,225 |
814
+ | Machine Translated Cross Lingual Parallel Corpora | 28,376,115 |
815
+ | SPECTER citation triplets | 684,100 |
816
+ | Machine Translations of SPECTER citation triplets | 4,104,600 |
817
+ | Natural Questions (NQ) | 100,231 |
818
+ | SQuAD2.0 | 87,599 |
819
+ | HotpotQA | 85,000 |
820
+ | Fever | 109,810 |
821
+ | PubMed | 20,000,000 |
822
+ | Multilingual Miracl Triples | 81,409 |
823
+ | Multilingual MrTydi Triples | 48,715 |
824
+ | Sadeeem Question Asnwering | 4,037 |
825
+ | DBPedia Title-Body Pairs | 4,635,922 |
826
+ | Synthetic: English Query-Wikipedia Passage | 1,879,093 |
827
+ | Synthetic: English Fact Verification | 9,888 |
828
+ | Synthetic: Multilingual Query-Wikipedia Passage | 300,266 |
829
+ | Synthetic: Multilingual News Summaries | 37,489 |
830
+ | IBM Internal Triples | 40,290 |
831
+ | IBM Internal Title-Body Pairs | 1,524,586 |
832
+
833
+ Notably, we do not use the popular MS-MARCO retrieval dataset in our training corpus due to its non-commercial license, while other open-source models train on this dataset due to its high quality.
834
+
835
+ **Infrastructure:**
836
+ We train Granite Embedding Models using IBM's computing cluster, Cognitive Compute Cluster, which is outfitted with NVIDIA A100 80gb GPUs. This cluster provides a scalable and efficient infrastructure for training our models over multiple GPUs.
837
+
838
+ **Ethical Considerations and Limitations:**
839
+ The data used to train the base language model was filtered to remove text containing hate, abuse, and profanity. Granite-Embedding-278m-Multilingual is trained only for English texts, and has a context length of 512 tokens (longer texts will be truncated to this size).
840
+
841
+
842
+ <!-- ## Citation
843
+ ```
844
+ @misc{granite-embedding-models,
845
+ author = {author 1, author2, ...},
846
+ title = {},
847
+ journal = {},
848
+ volume = {},
849
+ year = {2024},
850
+ url = {https://arxiv.org/abs/0000.00000},
851
+ }
852
+ ``` -->