RadLIT-CrossEncoder: Radiology Reranking Model
A cross-encoder model fine-tuned for reranking radiology document retrieval results. Designed to work as the second stage of the RadLITE pipeline, providing significant improvements on complex clinical queries.
Model Description
RadLIT-CrossEncoder takes a query-document pair and outputs a relevance score. Unlike bi-encoders that encode queries and documents separately, cross-encoders process them jointly, enabling more nuanced relevance judgments at the cost of higher latency.
Architecture
- Base Model: BERT architecture (medical-initialized)
- Hidden Size: 384
- Layers: 12
- Attention Heads: 12
- Parameters: ~33M (optimized for inference speed)
- Max Sequence Length: 512 tokens
- Output: Single relevance score (regression)
Training
The model was fine-tuned on radiology query-document pairs with relevance labels:
- Training Objective: Binary Cross-Entropy with soft labels
- Training Data: Expert-labeled query-document pairs from radiology education
- Hard Negatives: Mined from bi-encoder retrieval failures
- Batch Size: 16
- Learning Rate: 2e-5
- Epochs: 3
Note: Training data sources are not disclosed due to variable licensing. The model is released under Apache 2.0.
Performance
Impact on RadLITE Pipeline
When combined with RadLIT-BiEncoder:
| Configuration | MRR | Improvement |
|---|---|---|
| Bi-encoder only | 0.698 | baseline |
| + Cross-encoder reranking | 0.782 | +12.0% |
| + BM25 fusion (RadLITE) | 0.829 | +18.8% |
Performance on Complex Queries
The cross-encoder shows largest improvements on complex clinical reasoning queries:
| Query Type | Improvement |
|---|---|
| Board exam questions | +30.3% |
| Differential diagnosis | +22.5% |
| Staging/classification | +18.0% |
| Simple factual | +5.0% |
Subspecialty Impact
Greatest improvements on subspecialties requiring clinical reasoning:
| Subspecialty | Improvement with CE |
|---|---|
| Physics | +33.9% |
| Genitourinary | +20.1% |
| Neuroradiology | +18.0% |
| Gastrointestinal | +16.6% |
Usage
Installation
pip install sentence-transformers
Basic Usage
from sentence_transformers import CrossEncoder
# Load model
model = CrossEncoder('matulichpt/radlit-crossencoder')
# Score query-document pairs
pairs = [
["What are the CT findings in pulmonary embolism?",
"CT pulmonary angiography shows filling defects in the pulmonary arteries..."],
["What are the CT findings in pulmonary embolism?",
"MRI of the knee shows ACL tear with bone bruise pattern..."]
]
scores = model.predict(pairs)
print(scores) # [0.92, 0.08] - higher score = more relevant
Reranking Pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
# Load models
biencoder = SentenceTransformer('matulichpt/radlit-biencoder')
crossencoder = CrossEncoder('matulichpt/radlit-crossencoder')
def retrieve_and_rerank(query, corpus, corpus_embeddings, top_k=10, rerank_k=50):
# Stage 1: Bi-encoder retrieval
query_embedding = biencoder.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_indices = torch.topk(cos_scores, k=rerank_k)[1].tolist()
# Stage 2: Cross-encoder reranking
candidates = [corpus[i] for i in top_indices]
pairs = [[query, doc] for doc in candidates]
ce_scores = crossencoder.predict(pairs)
# Apply temperature calibration (IMPORTANT: use T=1.5)
calibrated_scores = ce_scores / 1.5
# Sort and return top-k
sorted_indices = np.argsort(calibrated_scores)[::-1][:top_k]
return [(candidates[i], calibrated_scores[i]) for i in sorted_indices]
# Example
results = retrieve_and_rerank(
"What are the imaging features of hepatocellular carcinoma?",
corpus, corpus_embeddings
)
Demo: Cross-Encoder Reranking
from sentence_transformers import CrossEncoder
import numpy as np
model = CrossEncoder('matulichpt/radlit-crossencoder')
query = "What causes ring-enhancing brain lesions in AIDS patients?"
# Candidates from bi-encoder retrieval (simulated)
candidates = [
"In AIDS, toxoplasmosis shows ring-enhancing lesions in basal ganglia. CNS lymphoma is typically periventricular.",
"Brain metastases occur at gray-white junction and may show ring enhancement.",
"Glioblastoma is the most common primary brain malignancy.",
]
# Score each candidate
pairs = [[query, doc] for doc in candidates]
scores = model.predict(pairs)
# Rank by relevance
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
print(f"Top result: {ranked[0][0][:80]}...")
print(f"Score: {ranked[0][1]:.2f}")
# The AIDS-specific answer ranks first despite shorter text
The cross-encoder correctly prioritizes the clinically relevant answer about AIDS-specific differentials.
Temperature Calibration
Important: For optimal performance in score fusion, apply temperature scaling:
# Raw CE scores have higher variance than bi-encoder scores
raw_scores = crossencoder.predict(pairs)
# Temperature calibration aligns score distributions
# T=1.5 found optimal through grid search
calibrated_scores = raw_scores / 1.5
This is critical when combining cross-encoder scores with bi-encoder scores.
Full RadLITE Fusion
def radlite_score(query, document, biencoder, crossencoder, bm25_score):
"""
Full RadLITE scoring with optimal weights.
Optimal weights (found via grid search on RadLIT-9):
- Bi-encoder: 0.5
- Cross-encoder: 0.2
- BM25: 0.3
"""
# Bi-encoder score
q_emb = biencoder.encode(query, convert_to_tensor=True)
d_emb = biencoder.encode(document, convert_to_tensor=True)
biencoder_score = float(util.cos_sim(q_emb, d_emb)[0][0])
# Cross-encoder score (calibrated)
ce_score = crossencoder.predict([[query, document]])[0] / 1.5
# Fusion
final_score = (
0.5 * biencoder_score +
0.2 * ce_score +
0.3 * bm25_score # Normalized BM25
)
return final_score
Technical Details
Why Temperature Calibration?
Cross-encoder scores tend to be more extreme than bi-encoder similarity scores:
| Score Type | Typical Range | Variance |
|---|---|---|
| Bi-encoder cosine | [0.3, 0.9] | Low |
| Raw CE score | [-2, 3] | High |
| Calibrated CE (T=1.5) | [-1.3, 2] | Medium |
Without calibration, the CE dominates the fusion and degrades overall performance. Temperature 1.5 achieves ~0.7 correlation between score distributions.
Latency Considerations
| Operation | Latency |
|---|---|
| Single pair scoring | ~4ms |
| 50 pairs (batch) | ~200-300ms |
| Bi-encoder (50 docs) | ~80-120ms |
For production use, consider:
- Limiting rerank candidates (50 is optimal)
- Batch processing
- GPU acceleration
Intended Use
Primary Use Cases
- Second-stage reranking for radiology retrieval
- Relevance scoring for radiology Q&A
- Fine-grained document ranking
Out-of-Scope Uses
- First-stage retrieval (too slow for large corpora)
- Non-radiology content
- Clinical diagnosis
Limitations
- Latency: ~4ms per pair; not suitable for first-stage retrieval
- Domain: Optimized for radiology; limited generalization
- Context Length: 512 tokens max; long documents need truncation
- Score Interpretation: Requires calibration for fusion
Ethical Considerations
- Not a diagnostic tool
- Should be used to surface relevant educational content, not replace clinical judgment
- May reflect biases in radiology literature
Citation
@software{radlit_crossencoder_2026,
title = {RadLIT-CrossEncoder: Radiology Reranking Model},
author = {Grai Team},
year = {2026},
url = {https://huggingface.co/matulichpt/radlit-crossencoder},
note = {+30% improvement on complex radiology queries}
}
Related Models
- RadLIT-BiEncoder - First-stage retrieval
- RadLITE Pipeline - Full retrieval system documentation
License
Apache 2.0 - Free for research and commercial use.
Contact
For questions or collaboration: Open an issue on the model repository
- Downloads last month
- 11
Evaluation results
- MRR (with bi-encoder) on RadLIT-9self-reported0.829
- MRR Improvement on Complex Queries on RadLIT-9self-reported0.300