File size: 6,745 Bytes
eb8e2d8 2f43805 eb8e2d8 2f43805 cef50f3 2f43805 7b2c316 2f43805 ebc0551 2f43805 ed6cbe0 2f43805 ed6cbe0 2f43805 603d977 2e63b5d 603d977 2f43805 603d977 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
metrics:
- Recall @10 0.438
- MRR @10 0.247
base_model:
- unicamp-dl/mt5-base-mmarco-v2
tags:
- Information Retrieval
- Natural Language Processing
- Question Answering
license: apache-2.0
---
# Urdu mT5 msmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval
As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR
## Model Details
### Model Description
- **Developed by:** Umer Butt
- **Model type:** IR model for reranking
- **Language(s) (NLP):** Python/pytorch
## Uses
### Direct Use
## Bias, Risks, and Limitations
Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
## Evaluation
The evaluation was done using the scripts in the pygaggle library. Specifically these files:
evaluate_monot5_reranker.py
ms_marco_eval.py
#### Metrics
Following the approach in the mmarco work. The same two metrics were used.
Recal @10 : 0.438
MRR @10 : 0.247
### Results
| Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked |
|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
| bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 |
| unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 |
| bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 |
| unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 |
| This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 |
### Model Architecture and Objective
{
"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
"architectures": ["MT5ForConditionalGeneration"],
"d_model": 768,
"num_heads": 12,
"num_layers": 12,
"dropout_rate": 0.1,
"vocab_size": 250112,
"model_type": "mt5",
"transformers_version": "4.38.2"
}
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.
## How to Get Started with the Model
Example Code for Scoring Query-Document Pairs:
In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import torch.nn.functional as F
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def rank_documents(query, documents):
# Create input pairs of query and documents
query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents]
# Tokenize the input pairs
inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate decoder input ids (starting with the decoder start token)
decoder_input_ids = torch.full(
(inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device
)
# Perform inference to get the logits
with torch.no_grad():
outputs = model(**inputs, decoder_input_ids=decoder_input_ids)
# Get the logits for the sequence output
logits = outputs.logits
# Extract the probabilities for the generated sequence
scores = []
for idx, doc in enumerate(documents):
# Calculate the softmax over the entire vocabulary for each token in the sequence
doc_logits = logits[idx]
doc_probs = F.softmax(doc_logits, dim=-1)
# Get the probability score for "ہاں" token in the output sequence
token_true_id = tokenizer.convert_tokens_to_ids("ہاں")
token_probs = doc_probs[:, token_true_id]
sum_prob = token_probs.sum().item() # Sum probability over the sequence
scores.append((doc, sum_prob)) # Use the summed probability directly as the score
# Normalize scores to be between 0 and 1
max_score = max(score for _, score in scores)
min_score = min(score for _, score in scores)
normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores]
# Create a list of documents with normalized scores
ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))]
# Sort documents based on scores (descending order)
ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True)
return ranked_documents
# Example query and documents
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
documents = [
"پاکستان کی معیشت میں بہتری کے اشارے ہیں۔",
"زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔",
"فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
]
# Get ranked documents
ranked_docs = rank_documents(query, documents)
# Print the ranked documents
for idx, (doc, score) in enumerate(ranked_docs):
print(f"Rank {idx + 1}: Score: {score}, Document: {doc}")
Rank 1: Score: 1.0, Document: پاکستان کی معیشت میں بہتری کے اشارے ہیں۔
Rank 2: Score: 0.547, Document: فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔
Rank 3: Score: 0.0, Document: زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔
```
## Model Card Authors [optional]
Umer Butt
## Model Card Contact
mumertbutt@gmail.com
|