|
--- |
|
metrics: |
|
- Recall @10 0.438 |
|
- MRR @10 0.247 |
|
base_model: |
|
- unicamp-dl/mt5-base-mmarco-v2 |
|
tags: |
|
- Information Retrieval |
|
- Natural Language Processing |
|
- Question Answering |
|
license: apache-2.0 |
|
--- |
|
|
|
# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval |
|
|
|
As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. |
|
We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. |
|
To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model |
|
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** Umer Butt |
|
- **Model type:** MT5ForConditionalGeneration |
|
- **Language(s) (NLP):** Python/pytorch |
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too. |
|
|
|
|
|
|
|
## Evaluation |
|
|
|
The evaluation was done using the scripts in the pygaggle library. Specifically these files: |
|
evaluate_monot5_reranker.py |
|
ms_marco_eval.py |
|
|
|
#### Metrics |
|
Following the approach in the mmarco work. The same two metrics were used. |
|
|
|
Recal @10 : 0.438 |
|
MRR @10 : 0.247 |
|
|
|
|
|
### Results |
|
|
|
| Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked | |
|
|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------| |
|
| bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 | |
|
| unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 | |
|
| bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 | |
|
| unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 | |
|
| This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 | |
|
|
|
|
|
|
|
|
|
|
|
### Model Architecture and Objective |
|
{ |
|
"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", |
|
"architectures": ["MT5ForConditionalGeneration"], |
|
"d_model": 768, |
|
"num_heads": 12, |
|
"num_layers": 12, |
|
"dropout_rate": 0.1, |
|
"vocab_size": 250112, |
|
"model_type": "mt5", |
|
"transformers_version": "4.38.2" |
|
} |
|
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation. |
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Example Code for Scoring Query-Document Pairs: |
|
In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking. |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco") |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
|
|
def rank_documents(query, documents): |
|
# Create input pairs of query and documents |
|
query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents] |
|
|
|
# Tokenize the input pairs |
|
inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
# Generate decoder input ids (starting with the decoder start token) |
|
decoder_input_ids = torch.full( |
|
(inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device |
|
) |
|
|
|
# Perform inference to get the logits |
|
with torch.no_grad(): |
|
outputs = model(**inputs, decoder_input_ids=decoder_input_ids) |
|
|
|
# Get the logits for the sequence output |
|
logits = outputs.logits |
|
|
|
# Extract the probabilities for the generated sequence |
|
scores = [] |
|
for idx, doc in enumerate(documents): |
|
# Calculate the softmax over the entire vocabulary for each token in the sequence |
|
doc_logits = logits[idx] |
|
doc_probs = F.softmax(doc_logits, dim=-1) |
|
|
|
# Get the probability score for "ہاں" token in the output sequence |
|
token_true_id = tokenizer.convert_tokens_to_ids("ہاں") |
|
token_probs = doc_probs[:, token_true_id] |
|
sum_prob = token_probs.sum().item() # Sum probability over the sequence |
|
scores.append((doc, sum_prob)) # Use the summed probability directly as the score |
|
|
|
# Normalize scores to be between 0 and 1 |
|
max_score = max(score for _, score in scores) |
|
min_score = min(score for _, score in scores) |
|
normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores] |
|
|
|
# Create a list of documents with normalized scores |
|
ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))] |
|
|
|
# Sort documents based on scores (descending order) |
|
ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True) |
|
return ranked_documents |
|
|
|
|
|
# Example query and documents |
|
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟" |
|
documents = [ |
|
"پاکستان کی معیشت میں بہتری کے اشارے ہیں۔", |
|
"زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔", |
|
"فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔" |
|
] |
|
|
|
# Get ranked documents |
|
ranked_docs = rank_documents(query, documents) |
|
|
|
# Print the ranked documents |
|
for idx, (doc, score) in enumerate(ranked_docs): |
|
print(f"Rank {idx + 1}: Score: {score}, Document: {doc}") |
|
|
|
|
|
Rank 1: Score: 1.0, Document: پاکستان کی معیشت میں بہتری کے اشارے ہیں۔ |
|
Rank 2: Score: 0.547, Document: فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔ |
|
Rank 3: Score: 0.0, Document: زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔ |
|
``` |
|
|
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
Umer Butt |
|
|
|
|
|
## Model Card Contact |
|
|
|
mumertbutt@gmail.com |
|
|