urdu-mt5-mmarco / README.md

Update README.md

cef50f3 verified about 2 months ago

6.75 kB

	---
	metrics:
	- Recall @10 0.438
	- MRR @10 0.247
	base_model:
	- unicamp-dl/mt5-base-mmarco-v2
	tags:
	- Information Retrieval
	- Natural Language Processing
	- Question Answering
	license: apache-2.0
	---

	# Urdu mT5 msmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval

	As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
	We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
	To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
	and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR

	## Model Details

	### Model Description



	- Developed by: Umer Butt
	- Model type: IR model for reranking
	- Language(s) (NLP): Python/pytorch



	## Uses



	### Direct Use




	## Bias, Risks, and Limitations

	Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.



	## Evaluation

	The evaluation was done using the scripts in the pygaggle library. Specifically these files:
	evaluate_monot5_reranker.py
	ms_marco_eval.py

	#### Metrics
	Following the approach in the mmarco work. The same two metrics were used.

	Recal @10 : 0.438
	MRR @10 : 0.247


	### Results

	\| Model \| Name \| Data \| Recall@10 \| MRR@10 \| Queries Ranked \|
	\|---------------------------------------\|---------------------------------------\|--------------\|-----------\|--------\|----------------\|
	\| bm25 (k = 1000) \| BM25 - Baseline from mmarco paper \| English data \| 0.391 \| 0.187 \| 6980 \|
	\| unicamp-dl/mt5-base-mmarco-v2 \| mmarco reranker - Baseline from paper \| English data \| \| 0.370 \| 6980 \|
	\| bm25 (k = 1000) \| BM25 \| Urdu data \| 0.2675 \| 0.129 \| 6980 \|
	\| unicamp-dl/mt5-base-mmarco-v2 \| Zero-shot mmarco \| Urdu data \| 0.408 \| 0.204 \| 6980 \|
	\| This work \| Mavkif/urdu-mt5-mmarco \| Urdu data \| 0.438 \| 0.247 \| 6980 \|





	### Model Architecture and Objective
	{
	"_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
	"architectures": ["MT5ForConditionalGeneration"],
	"d_model": 768,
	"num_heads": 12,
	"num_layers": 12,
	"dropout_rate": 0.1,
	"vocab_size": 250112,
	"model_type": "mt5",
	"transformers_version": "4.38.2"
	}
	For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.



	## How to Get Started with the Model

	Example Code for Scoring Query-Document Pairs:
	In an IR setting, you provide a query and one or more candidate documents. The model scores each document for relevance to the query, which can be used for ranking.
	```
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch
	import torch.nn.functional as F


	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("Mavkif/urdu-mt5-mmarco")
	model = AutoModelForSeq2SeqLM.from_pretrained("Mavkif/urdu-mt5-mmarco")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)


	def rank_documents(query, documents):
	# Create input pairs of query and documents
	query_document_pairs = [f"{query} [SEP] {doc}" for doc in documents]

	# Tokenize the input pairs
	inputs = tokenizer(query_document_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# Generate decoder input ids (starting with the decoder start token)
	decoder_input_ids = torch.full(
	(inputs["input_ids"].shape[0], 1), model.config.decoder_start_token_id, dtype=torch.long, device=device
	)

	# Perform inference to get the logits
	with torch.no_grad():
	outputs = model(**inputs, decoder_input_ids=decoder_input_ids)

	# Get the logits for the sequence output
	logits = outputs.logits

	# Extract the probabilities for the generated sequence
	scores = []
	for idx, doc in enumerate(documents):
	# Calculate the softmax over the entire vocabulary for each token in the sequence
	doc_logits = logits[idx]
	doc_probs = F.softmax(doc_logits, dim=-1)

	# Get the probability score for "ہاں" token in the output sequence
	token_true_id = tokenizer.convert_tokens_to_ids("ہاں")
	token_probs = doc_probs[:, token_true_id]
	sum_prob = token_probs.sum().item() # Sum probability over the sequence
	scores.append((doc, sum_prob)) # Use the summed probability directly as the score

	# Normalize scores to be between 0 and 1
	max_score = max(score for _, score in scores)
	min_score = min(score for _, score in scores)
	normalized_scores = [((score - min_score) / (max_score - min_score) if max_score > min_score else 0.5) for _, score in scores]

	# Create a list of documents with normalized scores
	ranked_documents = [(documents[idx], normalized_scores[idx]) for idx in range(len(documents))]

	# Sort documents based on scores (descending order)
	ranked_documents = sorted(ranked_documents, key=lambda x: x[1], reverse=True)
	return ranked_documents


	# Example query and documents
	query = "پاکستان کی معیشت کی موجودہ صورتحال کیا ہے؟"
	documents = [
	"پاکستان کی معیشت میں بہتری کے اشارے ہیں۔",
	"زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔",
	"فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔"
	]

	# Get ranked documents
	ranked_docs = rank_documents(query, documents)

	# Print the ranked documents
	for idx, (doc, score) in enumerate(ranked_docs):
	print(f"Rank {idx + 1}: Score: {score}, Document: {doc}")


	Rank 1: Score: 1.0, Document: پاکستان کی معیشت میں بہتری کے اشارے ہیں۔
	Rank 2: Score: 0.547, Document: فٹبال پاکستان میں تیزی سے مقبول ہو رہا ہے۔
	Rank 3: Score: 0.0, Document: زر مبادلہ کے ذخائر میں کمی دیکھی گئی ہے۔
	```



	## Model Card Authors [optional]

	Umer Butt


	## Model Card Contact

	mumertbutt@gmail.com