File size: 2,758 Bytes
8878a4b
f293ed0
 
 
 
 
 
 
35b8534
f293ed0
 
4c4c5a9
 
 
9d6ecc5
4c4c5a9
35b8534
 
 
f65f52e
 
f7aa2a2
f65f52e
0c59e8e
 
 
 
 
 
4a2ec2d
e39af2c
4a2ec2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c59e8e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
inference: false
datasets:
- unicamp-dl/mmarco
pipeline_tag: sentence-similarity
tags:
- ColBERT
base_model:
- aubmindlab/bert-base-arabertv02
license: mit
library_name: RAGatouille
---


# Arabic-ColBERT-100k 

First version of Arabic ColBERT. 
This model was trained on 100K random triplets of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco) which has around 39M Arabic (translated) triplets. 
mMARCO is the multiligual version of [Microsoft's MARCO dataset](https://microsoft.github.io/msmarco/). 

Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using 
[Lightning AI](https://lightning.ai/) with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model. 

If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version. 
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more, 
just replace the pretrained model name and make sure you use Arabic text and split documents for best results. 

You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).

# Training script 
```python
from datasets import load_dataset
from ragatouille import RAGTrainer
sample_size = 100000
ds = load_dataset('unicamp-dl/mmarco', 'arabic', split="train", trust_remote_code=True, streaming=True)
sds = ds.shuffle(seed=42, buffer_size=10_000)
dsf = sds.take(sample_size)
triplets = []
for item in iter(dsf):
  triplets.append((item["query"], item["positive"], item["negative"]))
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

trainer.train(batch_size=32,
    nbits=2, # How many bits will the trained model use when compressing indexes
    maxsteps=100000, # Maximum steps hard stop
    use_ib_negatives=True, # Use in-batch negative to calculate loss
    dim=128, # How many dimensions per embedding. 128 is the default and works well.
    learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
    use_relu=False, # Disable ReLU -- doesn't improve performance
    warmup_steps="auto", # Defaults to 10%
    )

```

Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy