--- metrics: - Recall @10 0.438 - MRR @10 0.247 base_model: - unicamp-dl/mt5-base-mmarco-v2 tags: - Information Retrieval - Natural Language Processing - Question Answering license: apache-2.0 --- # Urdu mT5 msmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model. To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR ## Model Details ### Model Description - **Developed by:** Umer Butt - **Model type:** IR model for reranking - **Language(s) (NLP):** Python/pytorch ## Uses ### Direct Use ## Bias, Risks, and Limitations Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too. ## Evaluation The evaluation was done using the scripts in the pygaggle library. Specifically these files: evaluate_monot5_reranker.py ms_marco_eval.py #### Metrics Following the approach in the mmarco work. The same two metrics were used. Recal @10 : 0.438 MRR @10 : 0.247 ### Results | Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked | |---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------| | bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 | | unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 | | bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 | | unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 | | This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 | ### Model Architecture and Objective { "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", "architectures": ["MT5ForConditionalGeneration"], "d_model": 768, "num_heads": 12, "num_layers": 12, "dropout_rate": 0.1, "vocab_size": 250112, "model_type": "mt5", "transformers_version": "4.38.2" } For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation. ## How to Get Started with the Model Example Code for Scoring Query-Document Pairs: In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking. ``` from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch import torch.nn.functional as F # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco") model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) def rank_documents(query, documents): # Create input pairs of query and documents query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents] # Tokenize the input pairs inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512) inputs = {k: v.to(device) for k, v in inputs.items()} # Generate decoder input ids (starting with the decoder start token) decoder_input_ids = torch.full( (inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device ) # Perform inference to get the logits with torch.no_grad(): outputs = model(**inputs, decoder_input_ids=decoder_input_ids) # Get the logits for the sequence output logits = outputs.logits # Extract the probabilities for the generated sequence scores = [] for idx, doc in enumerate(documents): # Calculate the softmax over the entire vocabulary for each token in the sequence doc_logits = logits[idx] doc_probs = F.softmax(doc_logits, dim=-1) # Get the probability score for "ہاں" token in the output sequence token_true_id = tokenizer.convert_tokens_to_ids("ہاں") token_probs = doc_probs[:, token_true_id] sum_prob = token_probs.sum().item() # Sum probability over the sequence scores.append((doc, sum_prob)) # Use the summed probability directly as the score # Normalize scores to be between 0 and 1 max_score = max(score for _, score in scores) min_score = min(score for _, score in scores) normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores] # Create a list of documents with normalized scores ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))] # Sort documents based on scores (descending order) ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True) return ranked_documents # Example query and documents query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟" documents = [ "پاکستان کی معیشت میں بہتری کے اشارے ہیں۔", "زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔", "فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔" ] # Get ranked documents ranked_docs = rank_documents(query, documents) # Print the ranked documents for idx, (doc, score) in enumerate(ranked_docs): print(f"Rank {idx + 1}: Score: {score}, Document: {doc}") Rank 1: Score: 1.0, Document: پاکستان کی معیشت میں بہتری کے اشارے ہیں۔ Rank 2: Score: 0.547, Document: فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔ Rank 3: Score: 0.0, Document: زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔ ``` ## Model Card Authors [optional] Umer Butt ## Model Card Contact mumertbutt@gmail.com