akhooli commited on
Commit
cb1fd98
1 Parent(s): 82a02df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -7
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  inference: false
3
  datasets:
4
- - unicamp-dl/mmarco
5
  pipeline_tag: sentence-similarity
6
  tags:
7
  - ColBERT
@@ -15,24 +15,28 @@ library_name: RAGatouille
15
  # Arabic-ColBERT-100k
16
 
17
  First version of Arabic ColBERT.
18
- This model was trained on 100K random triplets of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco) which has around 39M Arabic (translated) triplets.
19
- mMARCO is the multiligual version of [Microsoft's MARCO dataset](https://microsoft.github.io/msmarco/).
 
20
 
21
  Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
22
- [Lightning AI](https://lightning.ai/) with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
23
 
24
- If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version.
25
  Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
26
  just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
27
 
28
- You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
29
 
30
  # Training script
 
31
  ```python
32
  from datasets import load_dataset
33
  from ragatouille import RAGTrainer
34
  sample_size = 100000
35
- ds = load_dataset('unicamp-dl/mmarco', 'arabic', split="train", trust_remote_code=True, streaming=True)
 
 
36
  sds = ds.shuffle(seed=42, buffer_size=10_000)
37
  dsf = sds.take(sample_size)
38
  triplets = []
@@ -56,6 +60,7 @@ trainer.train(batch_size=32,
56
  Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
57
 
58
  Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
 
59
 
60
  ## Citation
61
 
 
1
  ---
2
  inference: false
3
  datasets:
4
+ - akhooli/arabic-triplets-1m-curated-sims-len
5
  pipeline_tag: sentence-similarity
6
  tags:
7
  - ColBERT
 
15
  # Arabic-ColBERT-100k
16
 
17
  First version of Arabic ColBERT.
18
+ This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len)
19
+ which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score.
20
+ More details on the dataset are available in the data card.
21
 
22
  Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
23
+ a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
24
 
25
+ If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version.
26
  Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
27
  just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
28
 
29
+ You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).
30
 
31
  # Training script
32
+
33
  ```python
34
  from datasets import load_dataset
35
  from ragatouille import RAGTrainer
36
  sample_size = 100000
37
+ ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', 'arabic', split="train", trust_remote_code=True, streaming=True)
38
+
39
+ # some data processing not in this script
40
  sds = ds.shuffle(seed=42, buffer_size=10_000)
41
  dsf = sds.take(sample_size)
42
  triplets = []
 
60
  Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
61
 
62
  Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
63
+ Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw
64
 
65
  ## Citation
66