inference: false
datasets:
- unicamp-dl/mmarco
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille
Arabic-ColBERT-100k
First version of Arabic ColBERT. This model was trained on 100K random triplets of the mMARCO dataset which has around 39M Arabic (translated) triplets. mMARCO is the multiligual version of Microsoft's MARCO dataset.
Training used the Ragatouille library using Lightning AI.
If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version. Use the Ragatouille examples to learn more, just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy