Arabic-ColBERT-100K / README.md
akhooli's picture
Update README.md
0063aa3 verified
|
raw
history blame
3.51 kB
metadata
inference: false
datasets:
  - akhooli/arabic-triplets-1m-curated-sims-len
pipeline_tag: sentence-similarity
tags:
  - ColBERT
base_model:
  - aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille

Arabic-ColBERT-100k

First version of Arabic ColBERT. This model was trained on 100K filtered triplets of the akhooli/arabic-triplets-1m-curated-sims-len which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score. More details on the dataset are available in the data card.

Training used the Ragatouille library using a 2-GPU Kaggle account with aubmindlab/bert-base-arabertv02 as base model.

If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version. Use the Ragatouille examples to learn more, just replace the pretrained model name and make sure you use Arabic text and split documents for best results.

You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).

Training script

from datasets import load_dataset
from ragatouille import RAGTrainer
sample_size = 100000
ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', split="train", trust_remote_code=True, streaming=True)

# some data processing not in this script (data filtered based on similarity scores) and 100K selected at random
sds = ds.shuffle(seed=42, buffer_size=10_000)
dsf = sds
triplets = []
for item in iter(dsf):
  triplets.append((item["query"], item["positive"], item["negative"]))
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

trainer.train(batch_size=32,
    nbits=4, # How many bits will the trained model use when compressing indexes
    maxsteps=3125, # Maximum steps hard stop
    use_ib_negatives=True, # Use in-batch negative to calculate loss
    dim=128, # How many dimensions per embedding. 128 is the default and works well.
    learning_rate=1e-5, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
    use_relu=False, # Disable ReLU -- doesn't improve performance
    warmup_steps="auto", # Defaults to 10%
    )

Install datasets and ragatouille first. Last checkpoint is saved in .ragatouille/..../colbert

Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw

Citation

@online{AbedKhooli,
    author    = 'Abed Khooli',
    title     = 'Arabic ColBERT 100K',
    publisher = 'Hugging Face',
    month     = 'jul',
    year      = '2024',
    url       = 'https://huggingface.co/akhooli/Arabic-ColBERT-100k',
}