Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
inference: false
|
3 |
datasets:
|
4 |
-
-
|
5 |
pipeline_tag: sentence-similarity
|
6 |
tags:
|
7 |
- ColBERT
|
@@ -15,24 +15,28 @@ library_name: RAGatouille
|
|
15 |
# Arabic-ColBERT-100k
|
16 |
|
17 |
First version of Arabic ColBERT.
|
18 |
-
This model was trained on 100K
|
19 |
-
|
|
|
20 |
|
21 |
Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
|
22 |
-
|
23 |
|
24 |
-
If you downloaded the model before July
|
25 |
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
|
26 |
just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
|
27 |
|
28 |
-
You can train a better model if you have access to adequate compute (can
|
29 |
|
30 |
# Training script
|
|
|
31 |
```python
|
32 |
from datasets import load_dataset
|
33 |
from ragatouille import RAGTrainer
|
34 |
sample_size = 100000
|
35 |
-
ds = load_dataset('
|
|
|
|
|
36 |
sds = ds.shuffle(seed=42, buffer_size=10_000)
|
37 |
dsf = sds.take(sample_size)
|
38 |
triplets = []
|
@@ -56,6 +60,7 @@ trainer.train(batch_size=32,
|
|
56 |
Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
|
57 |
|
58 |
Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
|
|
|
59 |
|
60 |
## Citation
|
61 |
|
|
|
1 |
---
|
2 |
inference: false
|
3 |
datasets:
|
4 |
+
- akhooli/arabic-triplets-1m-curated-sims-len
|
5 |
pipeline_tag: sentence-similarity
|
6 |
tags:
|
7 |
- ColBERT
|
|
|
15 |
# Arabic-ColBERT-100k
|
16 |
|
17 |
First version of Arabic ColBERT.
|
18 |
+
This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len)
|
19 |
+
which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score.
|
20 |
+
More details on the dataset are available in the data card.
|
21 |
|
22 |
Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
|
23 |
+
a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
|
24 |
|
25 |
+
If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version.
|
26 |
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
|
27 |
just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
|
28 |
|
29 |
+
You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).
|
30 |
|
31 |
# Training script
|
32 |
+
|
33 |
```python
|
34 |
from datasets import load_dataset
|
35 |
from ragatouille import RAGTrainer
|
36 |
sample_size = 100000
|
37 |
+
ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', 'arabic', split="train", trust_remote_code=True, streaming=True)
|
38 |
+
|
39 |
+
# some data processing not in this script
|
40 |
sds = ds.shuffle(seed=42, buffer_size=10_000)
|
41 |
dsf = sds.take(sample_size)
|
42 |
triplets = []
|
|
|
60 |
Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
|
61 |
|
62 |
Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
|
63 |
+
Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw
|
64 |
|
65 |
## Citation
|
66 |
|