Mavkif
/

urdu-mt5-mmarco

Safetensors

mt5

Information Retrieval

Natural Language Processing

Question Answering

Model card Files Files and versions Community

Mavkif commited on Nov 2

Commit

2f43805

•

1 Parent(s): eb8e2d8

Update README.md

Browse files

Files changed (1) hide show

README.md +123 -4

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-metrics: null
-Recall @10: 0.438
-MRR @10: 0.247
 base_model:
 - unicamp-dl/mt5-base-mmarco-v2
 tags:
@@ -9,4 +9,123 @@ tags:
 - Natural Language Processing
 - Question Answering
 license: apache-2.0
----

 ---
+metrics:
+- Recall @10 0.438
+- MRR @10 0.247
 base_model:
 - unicamp-dl/mt5-base-mmarco-v2
 tags:
 - Natural Language Processing
 - Question Answering
 license: apache-2.0
+---
+# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval
+As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
+We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
+To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
+and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** Umer Butt
+- **Model type:** MT5ForConditionalGeneration
+- **Language(s) (NLP):** Python/pytorch
+## Uses
+### Direct Use
+## Bias, Risks, and Limitations
+Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
+### Recommendations
+## How to Get Started with the Model
+Use the code below to get started with the model.
+## Evaluation
+The evaluation was done using the scripts in the pygaggle library. Specifically these files:
+evaluate_monot5_reranker.py
+ms_marco_eval.py
+#### Metrics
+Following the approach in the mmarco work. The same two metrics were used.
+Recal @10 : 0.438
+MRR @10 : 0.247
+### Results
+## Detailed Results
+| Model                                 | Name                                  | Data         | Recall@10 | MRR@10 | Queries Ranked |
+|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
+| bm25 (k = 1000)                       | BM25 - Baseline from mmarco paper     | English data | 0.391     | 0.187  | 6980           |
+| unicamp-dl/mt5-base-mmarco-v2         | mmarco reranker - Baseline from paper | English data |           | 0.370  | 6980           |
+| bm25 (k = 1000)                       | BM25                                  | Urdu data    | 0.2675    | 0.129  | 6980           |
+| unicamp-dl/mt5-base-mmarco-v2         | Zero-shot mmarco                      | Urdu data    | 0.408     | 0.204  | 6980           |
+| This work                             | Mavkif/urdu-mt5-mmarco                | Urdu data    | 0.438     | 0.247  | 6980           |
+#### Summary
+### Model Architecture and Objective
+From config.json :
+{
+  "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
+  "architectures": [
+    "MT5ForConditionalGeneration"
+  ],
+  "classifier_dropout": 0.0,
+  "d_ff": 2048,
+  "d_kv": 64,
+  "d_model": 768,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0.1,
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": true,
+  "is_gated_act": true,
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "mt5",
+  "num_decoder_layers": 12,
+  "num_heads": 12,
+  "num_layers": 12,
+  "output_past": true,
+  "pad_token_id": 0,
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": false,
+  "tokenizer_class": "T5Tokenizer",
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "vocab_size": 250112
+}
+## Model Card Authors [optional]
+Umer Butt
+## Model Card Contact
+mumertbutt@gmail.com