RadLIT-CrossEncoder: Radiology Reranking Model

A cross-encoder model fine-tuned for reranking radiology document retrieval results. Designed to work as the second stage of the RadLITE pipeline, providing significant improvements on complex clinical queries.

Model Description

RadLIT-CrossEncoder takes a query-document pair and outputs a relevance score. Unlike bi-encoders that encode queries and documents separately, cross-encoders process them jointly, enabling more nuanced relevance judgments at the cost of higher latency.

Architecture

  • Base Model: BERT architecture (medical-initialized)
  • Hidden Size: 384
  • Layers: 12
  • Attention Heads: 12
  • Parameters: ~33M (optimized for inference speed)
  • Max Sequence Length: 512 tokens
  • Output: Single relevance score (regression)

Training

The model was fine-tuned on radiology query-document pairs with relevance labels:

  • Training Objective: Binary Cross-Entropy with soft labels
  • Training Data: Expert-labeled query-document pairs from radiology education
  • Hard Negatives: Mined from bi-encoder retrieval failures
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Epochs: 3

Note: Training data sources are not disclosed due to variable licensing. The model is released under Apache 2.0.

Performance

Impact on RadLITE Pipeline

When combined with RadLIT-BiEncoder:

Configuration MRR Improvement
Bi-encoder only 0.698 baseline
+ Cross-encoder reranking 0.782 +12.0%
+ BM25 fusion (RadLITE) 0.829 +18.8%

Performance on Complex Queries

The cross-encoder shows largest improvements on complex clinical reasoning queries:

Query Type Improvement
Board exam questions +30.3%
Differential diagnosis +22.5%
Staging/classification +18.0%
Simple factual +5.0%

Subspecialty Impact

Greatest improvements on subspecialties requiring clinical reasoning:

Subspecialty Improvement with CE
Physics +33.9%
Genitourinary +20.1%
Neuroradiology +18.0%
Gastrointestinal +16.6%

Usage

Installation

pip install sentence-transformers

Basic Usage

from sentence_transformers import CrossEncoder

# Load model
model = CrossEncoder('matulichpt/radlit-crossencoder')

# Score query-document pairs
pairs = [
    ["What are the CT findings in pulmonary embolism?",
     "CT pulmonary angiography shows filling defects in the pulmonary arteries..."],
    ["What are the CT findings in pulmonary embolism?",
     "MRI of the knee shows ACL tear with bone bruise pattern..."]
]

scores = model.predict(pairs)
print(scores)  # [0.92, 0.08] - higher score = more relevant

Reranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Load models
biencoder = SentenceTransformer('matulichpt/radlit-biencoder')
crossencoder = CrossEncoder('matulichpt/radlit-crossencoder')

def retrieve_and_rerank(query, corpus, corpus_embeddings, top_k=10, rerank_k=50):
    # Stage 1: Bi-encoder retrieval
    query_embedding = biencoder.encode(query, convert_to_tensor=True)
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_indices = torch.topk(cos_scores, k=rerank_k)[1].tolist()

    # Stage 2: Cross-encoder reranking
    candidates = [corpus[i] for i in top_indices]
    pairs = [[query, doc] for doc in candidates]
    ce_scores = crossencoder.predict(pairs)

    # Apply temperature calibration (IMPORTANT: use T=1.5)
    calibrated_scores = ce_scores / 1.5

    # Sort and return top-k
    sorted_indices = np.argsort(calibrated_scores)[::-1][:top_k]
    return [(candidates[i], calibrated_scores[i]) for i in sorted_indices]

# Example
results = retrieve_and_rerank(
    "What are the imaging features of hepatocellular carcinoma?",
    corpus, corpus_embeddings
)

Demo: Cross-Encoder Reranking

from sentence_transformers import CrossEncoder
import numpy as np

model = CrossEncoder('matulichpt/radlit-crossencoder')

query = "What causes ring-enhancing brain lesions in AIDS patients?"

# Candidates from bi-encoder retrieval (simulated)
candidates = [
    "In AIDS, toxoplasmosis shows ring-enhancing lesions in basal ganglia. CNS lymphoma is typically periventricular.",
    "Brain metastases occur at gray-white junction and may show ring enhancement.",
    "Glioblastoma is the most common primary brain malignancy.",
]

# Score each candidate
pairs = [[query, doc] for doc in candidates]
scores = model.predict(pairs)

# Rank by relevance
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
print(f"Top result: {ranked[0][0][:80]}...")
print(f"Score: {ranked[0][1]:.2f}")
# The AIDS-specific answer ranks first despite shorter text

The cross-encoder correctly prioritizes the clinically relevant answer about AIDS-specific differentials.

Temperature Calibration

Important: For optimal performance in score fusion, apply temperature scaling:

# Raw CE scores have higher variance than bi-encoder scores
raw_scores = crossencoder.predict(pairs)

# Temperature calibration aligns score distributions
# T=1.5 found optimal through grid search
calibrated_scores = raw_scores / 1.5

This is critical when combining cross-encoder scores with bi-encoder scores.

Full RadLITE Fusion

def radlite_score(query, document, biencoder, crossencoder, bm25_score):
    """
    Full RadLITE scoring with optimal weights.

    Optimal weights (found via grid search on RadLIT-9):
    - Bi-encoder: 0.5
    - Cross-encoder: 0.2
    - BM25: 0.3
    """
    # Bi-encoder score
    q_emb = biencoder.encode(query, convert_to_tensor=True)
    d_emb = biencoder.encode(document, convert_to_tensor=True)
    biencoder_score = float(util.cos_sim(q_emb, d_emb)[0][0])

    # Cross-encoder score (calibrated)
    ce_score = crossencoder.predict([[query, document]])[0] / 1.5

    # Fusion
    final_score = (
        0.5 * biencoder_score +
        0.2 * ce_score +
        0.3 * bm25_score  # Normalized BM25
    )

    return final_score

Technical Details

Why Temperature Calibration?

Cross-encoder scores tend to be more extreme than bi-encoder similarity scores:

Score Type Typical Range Variance
Bi-encoder cosine [0.3, 0.9] Low
Raw CE score [-2, 3] High
Calibrated CE (T=1.5) [-1.3, 2] Medium

Without calibration, the CE dominates the fusion and degrades overall performance. Temperature 1.5 achieves ~0.7 correlation between score distributions.

Latency Considerations

Operation Latency
Single pair scoring ~4ms
50 pairs (batch) ~200-300ms
Bi-encoder (50 docs) ~80-120ms

For production use, consider:

  • Limiting rerank candidates (50 is optimal)
  • Batch processing
  • GPU acceleration

Intended Use

Primary Use Cases

  • Second-stage reranking for radiology retrieval
  • Relevance scoring for radiology Q&A
  • Fine-grained document ranking

Out-of-Scope Uses

  • First-stage retrieval (too slow for large corpora)
  • Non-radiology content
  • Clinical diagnosis

Limitations

  1. Latency: ~4ms per pair; not suitable for first-stage retrieval
  2. Domain: Optimized for radiology; limited generalization
  3. Context Length: 512 tokens max; long documents need truncation
  4. Score Interpretation: Requires calibration for fusion

Ethical Considerations

  • Not a diagnostic tool
  • Should be used to surface relevant educational content, not replace clinical judgment
  • May reflect biases in radiology literature

Citation

@software{radlit_crossencoder_2026,
  title = {RadLIT-CrossEncoder: Radiology Reranking Model},
  author = {Grai Team},
  year = {2026},
  url = {https://huggingface.co/matulichpt/radlit-crossencoder},
  note = {+30% improvement on complex radiology queries}
}

Related Models

  • RadLIT-BiEncoder - First-stage retrieval
  • RadLITE Pipeline - Full retrieval system documentation

License

Apache 2.0 - Free for research and commercial use.

Contact

For questions or collaboration: Open an issue on the model repository

Downloads last month
11
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results