Model Card: MedWhisper Large ITA

MedWhisper Large ITA is a domain-adapted variant of OpenAI Whisper Large v3 Turbo, fine-tuned with LoRA on a carefully curated set of Italian outpatient specialty visit recordings. The (synthetic) corpus emphasizes segments rich in clinical jargon, abbreviations, and formulaic expressions (e.g., Anamnesi, Holter delle 24 ore...), prioritizing terminology that ReportAId has identified as most frequent in Italian healthcare settings.

The objective is to boost robustness on real-world clinical audio, ensuring that medical terms and rare multi-word expressions are transcribed more faithfully. On the held-out test set, MedWhisper Large ITA reduces the Word Error Rate (WER) from 7.9% to 4.5% compared to the base model, with the largest improvements observed on domain-specific vocabulary.

Model Details

Model Description

Developed by: ReportAId AI Team
Model type: Automatic Speech Recognition (ASR)
Language(s): Italian
License: Private dataset, model released for research and experimentation purposes
Finetuned from: openai/whisper-large-v3-turbo

Dataset

The model was trained on a proprietary ReportAId dataset, consisting of synthetic Italian outpatient specialty visit recordings. The dataset was specifically designed to emphasize clinical language and specialized terminology typical of Italian healthcare settings.

To build the textual corpus, we used an LLM (gpt-4.1) to generate 7 example sentences for each of 800 prioritized clinical terms, focusing on terminology that ReportAId has identified as most frequent in Italian healthcare contexts. This process was guided by ReportAId's clinical team, to ensure coverage of the most relevant medical expressions in real-world practice.

The generated sentences were subsequently reviewed and validated by ReportAId’s clinical team, to ensure medical plausibility, domain relevance, and linguistic accuracy.

Once validated, the sentences were converted into audio using the Eleven Flash v2.5 speech synthesis model by ElevenLabs.
Where necessary, human corrections were applied to fix mispronunciations or unnatural prosody, ensuring consistent audio quality.

Total audio duration:

Split	Minutes	Hours ≈
Train	3416.20	~57 h
Validation	388.50	~6.5 h
Test	197.72	~3.3 h

The test set is the same one used to evaluate openai/whisper-large-v3-turbo, ensuring a fair comparison.

Performance

WER

The following table summarizes the WER achieved by MedWhisper Large ITA compared to Whisper Large v3 Turbo, evaluated on the same held-out clinical speech test set.

WER on held-out clinical test set

Model	WER ↓
Whisper Large v3 Turbo	7.9%
MedWhisper Large ITA	4.5%

This represents a ~43% relative error reduction, with the most substantial improvements observed on domain-specific medical terminology — a crucial factor for reliable clinical transcription.

Semantic Equivalence Evaluation

To complement WER, we introduced a custom qualitative metric, leveraging GPT-5 as a judge of semantic equivalence between sentences. This accounts for cases where wording differs (e.g. numbers spelled out vs. digits) while the meaning remains identical.

Examples

“sento male alla vertebra l2” ≡ “sento male alla vertebra l due”
“ho preso tachipirina 500mg” ≡ “ho preso tachipirina cinquecento mg”
“via garibaldi 12” ≡ “via garibaldi dodici”
“esame il 03/05/2025” ≡ “esame il tre maggio 2025”
“pressione 120/80” ≡ “pressione centoventi su ottanta”

Traditional WER does not capture such equivalences, while GPT-5 successfully judges them as semantically identical.

Quantitative results

% Sentences with WER = 0
- Whisper Large v3 Turbo: 39.3%
- MedWhisper Large ITA: 55.8%
% Sentences (with WER > 0) judged semantically equivalent
- Whisper Large v3 Turbo: 96.63%
- MedWhisper Large ITA: 97.42%
Avg. WER on semantically non-equivalent sentences
- Whisper Large v3 Turbo: 14.01%
- MedWhisper Large ITA: 10.36%

Model	% Sentences with WER = 0	% Semantically equivalent (WER > 0)	Avg. WER (semantically non-equivalent)
Whisper Large v3 Turbo	39.3%	96.63%	14.01%
MedWhisper Large ITA	55.8%	97.42%	10.36%

Here, “Avg. WER (semantically non-equivalent)” refers to the mean WER computed only on sentences that were not perfectly transcribed (WER>0) and also judged non-equivalent by GPT-5.

📌 Summary of results

MedWhisper Large ITA achieves substantial gains in both WER and semantic equivalence (as judged by GPT-5), yielding fewer non-equivalent outputs and more faithful transcriptions of domain-specific medical terms.

Uses

Direct Use

Automatic transcription in Italian
Use in reporting systems, meeting transcription, voice-to-text

Recommended Preprocessing

Whisper models are susceptible to hallucinations, especially during extended silence or low-SNR segments. To achieve the best performance, we recommend normalizing and cleaning the audio files before inference.

Example functions:

import io
import numpy as np
import soundfile as sf
from pydub import AudioSegment
from pydub.silence import split_on_silence
from noisereduce import reduce_noise

def normalize_audio(audio_bytes: bytes) -> bytes:
    """Converts an audio chunk to 16kHz, mono, WAV PCM"""
    audio_segment = AudioSegment.from_file(io.BytesIO(audio_bytes))
    audio_segment = audio_segment.set_frame_rate(16000).set_channels(1).set_sample_width(2)
    buffer = io.BytesIO()
    audio_segment.export(buffer, format="wav")
    return buffer.getvalue()

def normalize_volume(audio_bytes: bytes) -> bytes:
    """Normalizes the volume to -1 dBFS avoiding clipping"""
    audio_segment = AudioSegment.from_wav(io.BytesIO(audio_bytes))
    normalized_segment = audio_segment.normalize(headroom=0.1)
    buffer = io.BytesIO()
    normalized_segment.export(buffer, format="wav")
    return buffer.getvalue()

def reduce_background_noise(audio_bytes: bytes) -> bytes:
    """Background noise reduction (beta)"""
    buffer_read = io.BytesIO(audio_bytes)
    rate, data = sf.read(buffer_read)
    if data.ndim > 1:
        data = np.mean(data, axis=1)
    reduced_noise_data = reduce_noise(y=data, sr=rate)
    buffer_write = io.BytesIO()
    sf.write(buffer_write, reduced_noise_data, rate, format='wav')
    return buffer_write.getvalue()

def remove_silence(audio_bytes: bytes) -> bytes:
    """Removes silent segments while maintaining fluidity"""
    audio_segment = AudioSegment.from_wav(io.BytesIO(audio_bytes))
    chunks = split_on_silence(audio_segment, min_silence_len=100, silence_thresh=-35, keep_silence=80)
    if not chunks:
        return b''
    processed_segment = sum(chunks, AudioSegment.empty())
    buffer = io.BytesIO()
    processed_segment.export(buffer, format="wav")
    return buffer.getvalue()

Training Details

Technique: LoRA fine-tuning
Hardware: 4× NVIDIA A6000
Precision: mixed fp16
Optimizer: AdamW

Evaluation

Test set: ReportAId dataset (197.7m, specific domain)
Metric: WER
Result: 4.5%

Inference speed

On a machine with 4 CPU cores using Faster-Whisper, MedWhisper Large ITA processes audio in 50–70% of the input duration (RTF ≈ 0.5–0.7), which is roughly 2× more efficient than running the original model with Transformers. This means that a 1-minute audio file can be transcribed in ~30–42 seconds, making the model suitable for near real-time transcription even on CPU-only setups. The equivalent version of MedWhisper Large ITA supported by Faster-Whisper is available as ReportAId/medwhisper-large-v3-ita-ct2.

Model Card Contact

Team: ReportAId

How to Get Started with the Model

import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# Load model and processor
model_id = "ReportAId/medwhisper-large-v3-ita"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()

# Load an audio file (16kHz, mono)
speech_array, sampling_rate = torchaudio.load("sample.wav")
if sampling_rate != 16000:
    speech_array = torchaudio.functional.resample(speech_array, sampling_rate, 16000)

# Preprocess
inputs = processor(speech_array.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

How to use the model with Faster-Whisper (~2 times more efficient)

from faster_whisper import WhisperModel

model=WhisperModel('ReportAId/medwhisper-large-v3-ita-ct2')


segments, info = model.transcribe('audio.mp3')

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

🚀 Try it out

You can try MedWhisper Large ITA directly in your browser through an interactive demo.

Upload or record audio to experience near real-time transcription in Italian — no installation required.

Downloads last month: 165

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for ReportAId/medwhisper-large-v3-ita

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo