akhooli
/

Arabic-ColBERT-100K

Sentence Similarity

Safetensors

RAGatouille

bert

ColBERT

Model card Files Files and versions Community

akhooli commited on Jul 27

Commit

cb1fd98

•

1 Parent(s): 82a02df

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -7

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 inference: false
 datasets:
-- unicamp-dl/mmarco
 pipeline_tag: sentence-similarity
 tags:
 - ColBERT
@@ -15,24 +15,28 @@ library_name: RAGatouille
 # Arabic-ColBERT-100k
 First version of Arabic ColBERT.
-This model was trained on 100K random triplets of the [mMARCO dataset](https://huggingface.co/datasets/unicamp-dl/mmarco) which has around 39M Arabic (translated) triplets.
-mMARCO is the multiligual version of [Microsoft's MARCO dataset](https://microsoft.github.io/msmarco/).
 Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
-[Lightning AI](https://lightning.ai/) with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
-If you downloaded the model before July 15th 1 pm (Jerusalem time), please try the current version.
 Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
 just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
-You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
 # Training script
 ```python
 from datasets import load_dataset
 from ragatouille import RAGTrainer
 sample_size = 100000
-ds = load_dataset('unicamp-dl/mmarco', 'arabic', split="train", trust_remote_code=True, streaming=True)
 sds = ds.shuffle(seed=42, buffer_size=10_000)
 dsf = sds.take(sample_size)
 triplets = []
@@ -56,6 +60,7 @@ trainer.train(batch_size=32,
 Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
 Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
 ## Citation

 ---
 inference: false
 datasets:
+- akhooli/arabic-triplets-1m-curated-sims-len
 pipeline_tag: sentence-similarity
 tags:
 - ColBERT
 # Arabic-ColBERT-100k
 First version of Arabic ColBERT.
+This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len)
+which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score.
+More details on the dataset are available in the data card.
 Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
+a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.
+If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version.
 Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
 just replace the pretrained model name and make sure you use Arabic text and split documents for best results.
+You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).
 # Training script
 ```python
 from datasets import load_dataset
 from ragatouille import RAGTrainer
 sample_size = 100000
+ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', 'arabic', split="train", trust_remote_code=True, streaming=True)
+# some data processing not in this script
 sds = ds.shuffle(seed=42, buffer_size=10_000)
 dsf = sds.take(sample_size)
 triplets = []
 Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`
 Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
+Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw
 ## Citation