metadata

inference: false
datasets:
  - unicamp-dl/mmarco
pipeline_tag: sentence-similarity
tags:
  - ColBERT
base_model:
  - aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille

Arabic-ColBERT-100k

First version of Arabic ColBERT. This model was trained on 100K random triplets of the mMARCO dataset which has around 39M Arabic (translated) triplets. mMARCO is the multiligual version of Microsoft's MARCO dataset.

Training used the Ragatouille library using Lightning AI with aubmindlab/bert-base-arabertv02 as base model.

If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version. Use the Ragatouille examples to learn more, just replace the pretrained model name and make sure you use Arabic text and split documents for best results.

You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).

Training script

from datasets import load_dataset
from ragatouille import RAGTrainer
sample_size = 100000
ds = load_dataset('unicamp-dl/mmarco', 'arabic', split="train", trust_remote_code=True, streaming=True)
sds = ds.shuffle(seed=42, buffer_size=10_000)
dsf = sds.take(sample_size)
triplets = []
for item in iter(dsf):
  triplets.append((item["query"], item["positive"], item["negative"]))
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

trainer.train(batch_size=32,
    nbits=2, # How many bits will the trained model use when compressing indexes
    maxsteps=100000, # Maximum steps hard stop
    use_ib_negatives=True, # Use in-batch negative to calculate loss
    dim=128, # How many dimensions per embedding. 128 is the default and works well.
    learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
    use_relu=False, # Disable ReLU -- doesn't improve performance
    warmup_steps="auto", # Defaults to 10%
    )

Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy