File size: 1,346 Bytes
8878a4b f293ed0 35b8534 f293ed0 4c4c5a9 9d6ecc5 4c4c5a9 35b8534 f65f52e 0c59e8e f65f52e 0c59e8e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
---
inference: false
datasets:
- unicamp-dl/mmarco
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille
---
# Arabic-ColBERT-100k
First version of Arabic ColBERT.
This model was trained on 100K random triplets of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco) which has around 39M Arabic (translated) triplets.
mMARCO is the multiligual version of [Microsoft's MARCO dataset](https://microsoft.github.io/msmarco/).
Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
[Lightning AI](https://lightning.ai/).
If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version.
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy |