AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels
Abstract
Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB.
Community
- We propose Self-Learning Hypothetical Document Embeddings (SL-HyDE) for zero-shot medical information retrieval, eliminating the need for relevance-labeled data.
- We develop a comprehensive Chinese Medical Information Retrieval Benchmark (CMIRB) and evaluate the performance of various retrieval models using this benchmark.
- CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback (2024)
- REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models (2024)
- SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains (2024)
- UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers (2024)
- Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 10
Browse 10 datasets citing this paperSpaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper