File size: 6,744 Bytes

---
metrics: 
- Recall @10 0.438
- MRR @10 0.247
base_model:
- unicamp-dl/mt5-base-mmarco-v2
tags:
- Information Retrieval
- Natural Language Processing
- Question Answering
license: apache-2.0
---

# Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval

As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. 
We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. 
To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model 
and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR 

## Model Details

### Model Description



- **Developed by:** Umer Butt
- **Model type:** IR model for reranking
- **Language(s) (NLP):** Python/pytorch



## Uses



### Direct Use




## Bias, Risks, and Limitations

Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.



## Evaluation

The evaluation was done using the scripts in the pygaggle library. Specifically these files:
evaluate_monot5_reranker.py
ms_marco_eval.py

#### Metrics
Following the approach in the mmarco work. The same two metrics were used.

Recal @10 : 0.438
MRR @10 : 0.247


### Results

| Model                                 | Name                                  | Data         | Recall@10 | MRR@10 | Queries Ranked |
|---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
| bm25 (k = 1000)                       | BM25 - Baseline from mmarco paper     | English data | 0.391     | 0.187  | 6980           |
| unicamp-dl/mt5-base-mmarco-v2         | mmarco reranker - Baseline from paper | English data |           | 0.370  | 6980           |
| bm25 (k = 1000)                       | BM25                                  | Urdu data    | 0.2675    | 0.129  | 6980           |
| unicamp-dl/mt5-base-mmarco-v2         | Zero-shot mmarco                      | Urdu data    | 0.408     | 0.204  | 6980           |
| This work                             | Mavkif/urdu-mt5-mmarco                | Urdu data    | 0.438     | 0.247  | 6980           |





### Model Architecture and Objective
{
    "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
    "architectures": ["MT5ForConditionalGeneration"],
    "d_model": 768,
    "num_heads": 12,
    "num_layers": 12,
    "dropout_rate": 0.1,
    "vocab_size": 250112,
    "model_type": "mt5",
    "transformers_version": "4.38.2"
}
For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.



## How to Get Started with the Model

Example Code for Scoring Query-Document Pairs:
In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import torch.nn.functional as F


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


def rank_documents(query, documents):
    # Create input pairs of query and documents
    query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents]
    
    # Tokenize the input pairs
    inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Generate decoder input ids (starting with the decoder start token)
    decoder_input_ids = torch.full(
        (inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device
    )
    
    # Perform inference to get the logits
    with torch.no_grad():
        outputs = model(**inputs, decoder_input_ids=decoder_input_ids)
    
    # Get the logits for the sequence output
    logits = outputs.logits
    
    # Extract the probabilities for the generated sequence
    scores = []
    for idx, doc in enumerate(documents):
        # Calculate the softmax over the entire vocabulary for each token in the sequence
        doc_logits = logits[idx]
        doc_probs = F.softmax(doc_logits, dim=-1)
        
        # Get the probability score for "ہاں" token in the output sequence
        token_true_id = tokenizer.convert_tokens_to_ids("ہاں")
        token_probs = doc_probs[:, token_true_id]
        sum_prob = token_probs.sum().item()  # Sum probability over the sequence
        scores.append((doc, sum_prob))  # Use the summed probability directly as the score
    
    # Normalize scores to be between 0 and 1
    max_score = max(score for _, score in scores)
    min_score = min(score for _, score in scores)
    normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores]
    
    # Create a list of documents with normalized scores
    ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))]
    
    # Sort documents based on scores (descending order)
    ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True)
    return ranked_documents


# Example query and documents
query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
documents = [
    "پاکستان کی معیشت میں بہتری کے اشارے ہیں۔",
    "زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔",
    "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
]

# Get ranked documents
ranked_docs = rank_documents(query, documents)

# Print the ranked documents
for idx, (doc, score) in enumerate(ranked_docs):
    print(f"Rank {idx + 1}: Score: {score}, Document: {doc}")


Rank 1: Score: 1.0, Document: پاکستان کی معیشت میں بہتری کے اشارے ہیں۔
Rank 2: Score: 0.547, Document: فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔
Rank 3: Score: 0.0, Document: زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔
```



## Model Card Authors [optional]

Umer Butt 


## Model Card Contact

mumertbutt@gmail.com