File size: 3,577 Bytes
8878a4b f293ed0 cb1fd98 f293ed0 35b8534 f293ed0 4c4c5a9 9d6ecc5 4c4c5a9 c31c1f9 cb1fd98 f65f52e cb1fd98 f65f52e cb1fd98 0c59e8e cb1fd98 0c59e8e 4a2ec2d cb1fd98 e39af2c 4a2ec2d 0063aa3 cb1fd98 0063aa3 4a2ec2d 0063aa3 4a2ec2d 0063aa3 4a2ec2d 0063aa3 4a2ec2d fbfda76 0c59e8e fbfda76 cb1fd98 fbfda76 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
inference: false
datasets:
- akhooli/arabic-triplets-1m-curated-sims-len
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille
---
# Arabic-ColBERT-100k
First version of Arabic ColBERT (better models are available now - see the 250k and 711k ones).
This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len)
which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score.
More details on the dataset are available in the data card.
Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version.
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).
# Training script
```python
from datasets import load_dataset
from ragatouille import RAGTrainer
sample_size = 100000
ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', split="train", trust_remote_code=True, streaming=True)
# some data processing not in this script (data filtered based on similarity scores) and 100K selected at random
sds = ds.shuffle(seed=42, buffer_size=10_000)
dsf = sds
triplets = []
for item in iter(dsf):
triplets.append((item["query"], item["positive"], item["negative"]))
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)
trainer.train(batch_size=32,
nbits=4, # How many bits will the trained model use when compressing indexes
maxsteps=3125, # Maximum steps hard stop
use_ib_negatives=True, # Use in-batch negative to calculate loss
dim=128, # How many dimensions per embedding. 128 is the default and works well.
learning_rate=1e-5, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
use_relu=False, # Disable ReLU -- doesn't improve performance
warmup_steps="auto", # Defaults to 10%
)
```
Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw
## Citation
```bibtex
@online{AbedKhooli,
author = 'Abed Khooli',
title = 'Arabic ColBERT 100K',
publisher = 'Hugging Face',
month = 'jul',
year = '2024',
url = 'https://huggingface.co/akhooli/Arabic-ColBERT-100k',
}
``` |