File size: 3,577 Bytes

8878a4b
f293ed0
 
cb1fd98
f293ed0
 
 
 
35b8534
f293ed0
 
4c4c5a9
 
 
9d6ecc5
4c4c5a9
c31c1f9
cb1fd98
 
 
f65f52e
 
cb1fd98
f65f52e
cb1fd98
0c59e8e
 
 
cb1fd98
0c59e8e
4a2ec2d
cb1fd98
e39af2c
4a2ec2d
 
 
0063aa3
cb1fd98
0063aa3
4a2ec2d
0063aa3
4a2ec2d
 
 
 
 
 
 
0063aa3
 
4a2ec2d
 
0063aa3
4a2ec2d
 
 
 
 
 
fbfda76
0c59e8e
fbfda76
cb1fd98
fbfda76

---
inference: false
datasets:
- akhooli/arabic-triplets-1m-curated-sims-len
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille
---


# Arabic-ColBERT-100k 

First version of Arabic ColBERT (better models are available now - see the 250k and 711k ones). 
This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len) 
which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score. 
More details on the dataset are available in the data card.  

Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using 
a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model. 

If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version. 
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more, 
just replace the pretrained model name and make sure you use Arabic text and split documents for best results. 

You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).

# Training script 

```python
from datasets import load_dataset
from ragatouille import RAGTrainer
sample_size = 100000
ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', split="train", trust_remote_code=True, streaming=True)

# some data processing not in this script (data filtered based on similarity scores) and 100K selected at random
sds = ds.shuffle(seed=42, buffer_size=10_000)
dsf = sds
triplets = []
for item in iter(dsf):
  triplets.append((item["query"], item["positive"], item["negative"]))
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

trainer.train(batch_size=32,
    nbits=4, # How many bits will the trained model use when compressing indexes
    maxsteps=3125, # Maximum steps hard stop
    use_ib_negatives=True, # Use in-batch negative to calculate loss
    dim=128, # How many dimensions per embedding. 128 is the default and works well.
    learning_rate=1e-5, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
    use_relu=False, # Disable ReLU -- doesn't improve performance
    warmup_steps="auto", # Defaults to 10%
    )

```
Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert` 

Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy  
Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw

## Citation

```bibtex
@online{AbedKhooli,
	author    = 'Abed Khooli',
	title     = 'Arabic ColBERT 100K',
	publisher = 'Hugging Face',
	month     = 'jul',
	year      = '2024',
	url       = 'https://huggingface.co/akhooli/Arabic-ColBERT-100k',
}
```