Model Card: MedWhisper Large ITA

MedWhisper Large ITA is a domain-adapted variant of OpenAI Whisper Large v3 Turbo, fine-tuned with LoRA on a carefully curated set of Italian outpatient specialty visit recordings. The (synthetic) corpus emphasizes segments rich in clinical jargon, abbreviations, and formulaic expressions (e.g., Anamnesi, Holter delle 24 ore...), prioritizing terminology that ReportAId has identified as most frequent in Italian healthcare settings.

The objective is to boost robustness on real-world clinical audio, ensuring that medical terms and rare multi-word expressions are transcribed more faithfully. On the held-out test set, MedWhisper Large ITA reduces the Word Error Rate (WER) from 7.9% to 4.5% compared to the base model, with the largest improvements observed on domain-specific vocabulary.


Model Details

Model Description

  • Developed by: ReportAId AI Team
  • Model type: Automatic Speech Recognition (ASR)
  • Language(s): Italian
  • License: Private dataset, model released for research and experimentation purposes
  • Finetuned from: openai/whisper-large-v3-turbo

Dataset

The model was trained on a proprietary ReportAId dataset, consisting of synthetic Italian outpatient specialty visit recordings. The dataset was specifically designed to emphasize clinical language and specialized terminology typical of Italian healthcare settings.

To build the textual corpus, we used an LLM (gpt-4.1) to generate 7 example sentences for each of 800 prioritized clinical terms, focusing on terminology that ReportAId has identified as most frequent in Italian healthcare contexts. This process was guided by ReportAId's clinical team, to ensure coverage of the most relevant medical expressions in real-world practice.

The generated sentences were subsequently reviewed and validated by ReportAId’s clinical team, to ensure medical plausibility, domain relevance, and linguistic accuracy.

Once validated, the sentences were converted into audio using the Eleven Flash v2.5 speech synthesis model by ElevenLabs.
Where necessary, human corrections were applied to fix mispronunciations or unnatural prosody, ensuring consistent audio quality.

Total audio duration:

Split Minutes Hours β‰ˆ
Train 3416.20 ~57 h
Validation 388.50 ~6.5 h
Test 197.72 ~3.3 h

The test set is the same one used to evaluate openai/whisper-large-v3-turbo, ensuring a fair comparison.


Performance

WER

The following table summarizes the WER achieved by MedWhisper Large ITA compared to Whisper Large v3 Turbo, evaluated on the same held-out clinical speech test set.

WER on held-out clinical test set

Model WER ↓
Whisper Large v3 Turbo 7.9%
MedWhisper Large ITA 4.5%

This represents a ~43% relative error reduction, with the most substantial improvements observed on domain-specific medical terminology β€” a crucial factor for reliable clinical transcription.

Semantic Equivalence Evaluation

To complement WER, we introduced a custom qualitative metric, leveraging GPT-5 as a judge of semantic equivalence between sentences. This accounts for cases where wording differs (e.g. numbers spelled out vs. digits) while the meaning remains identical.

Examples

  • β€œsento male alla vertebra l2” ≑ β€œsento male alla vertebra l due”
  • β€œho preso tachipirina 500mg” ≑ β€œho preso tachipirina cinquecento mg”
  • β€œvia garibaldi 12” ≑ β€œvia garibaldi dodici”
  • β€œesame il 03/05/2025” ≑ β€œesame il tre maggio 2025”
  • β€œpressione 120/80” ≑ β€œpressione centoventi su ottanta”

Traditional WER does not capture such equivalences, while GPT-5 successfully judges them as semantically identical.

Quantitative results

  • % Sentences with WER = 0

    • Whisper Large v3 Turbo: 39.3%
    • MedWhisper Large ITA: 55.8%
  • % Sentences (with WER > 0) judged semantically equivalent

    • Whisper Large v3 Turbo: 96.63%
    • MedWhisper Large ITA: 97.42%
  • Avg. WER on semantically non-equivalent sentences

    • Whisper Large v3 Turbo: 14.01%
    • MedWhisper Large ITA: 10.36%
Model % Sentences with WER = 0 % Semantically equivalent (WER > 0) Avg. WER (semantically non-equivalent)
Whisper Large v3 Turbo 39.3% 96.63% 14.01%
MedWhisper Large ITA 55.8% 97.42% 10.36%

Here, β€œAvg. WER (semantically non-equivalent)” refers to the mean WER computed only on sentences that were not perfectly transcribed (WER>0) and also judged non-equivalent by GPT-5.


πŸ“Œ Summary of results

MedWhisper Large ITA achieves substantial gains in both WER and semantic equivalence (as judged by GPT-5), yielding fewer non-equivalent outputs and more faithful transcriptions of domain-specific medical terms.


Uses

Direct Use

  • Automatic transcription in Italian
  • Use in reporting systems, meeting transcription, voice-to-text

Recommended Preprocessing

Whisper models are susceptible to hallucinations, especially during extended silence or low-SNR segments. To achieve the best performance, we recommend normalizing and cleaning the audio files before inference.

Example functions:

import io
import numpy as np
import soundfile as sf
from pydub import AudioSegment
from pydub.silence import split_on_silence
from noisereduce import reduce_noise

def normalize_audio(audio_bytes: bytes) -> bytes:
    """Converts an audio chunk to 16kHz, mono, WAV PCM"""
    audio_segment = AudioSegment.from_file(io.BytesIO(audio_bytes))
    audio_segment = audio_segment.set_frame_rate(16000).set_channels(1).set_sample_width(2)
    buffer = io.BytesIO()
    audio_segment.export(buffer, format="wav")
    return buffer.getvalue()

def normalize_volume(audio_bytes: bytes) -> bytes:
    """Normalizes the volume to -1 dBFS avoiding clipping"""
    audio_segment = AudioSegment.from_wav(io.BytesIO(audio_bytes))
    normalized_segment = audio_segment.normalize(headroom=0.1)
    buffer = io.BytesIO()
    normalized_segment.export(buffer, format="wav")
    return buffer.getvalue()

def reduce_background_noise(audio_bytes: bytes) -> bytes:
    """Background noise reduction (beta)"""
    buffer_read = io.BytesIO(audio_bytes)
    rate, data = sf.read(buffer_read)
    if data.ndim > 1:
        data = np.mean(data, axis=1)
    reduced_noise_data = reduce_noise(y=data, sr=rate)
    buffer_write = io.BytesIO()
    sf.write(buffer_write, reduced_noise_data, rate, format='wav')
    return buffer_write.getvalue()

def remove_silence(audio_bytes: bytes) -> bytes:
    """Removes silent segments while maintaining fluidity"""
    audio_segment = AudioSegment.from_wav(io.BytesIO(audio_bytes))
    chunks = split_on_silence(audio_segment, min_silence_len=100, silence_thresh=-35, keep_silence=80)
    if not chunks:
        return b''
    processed_segment = sum(chunks, AudioSegment.empty())
    buffer = io.BytesIO()
    processed_segment.export(buffer, format="wav")
    return buffer.getvalue()

Training Details

  • Technique: LoRA fine-tuning
  • Hardware: 4Γ— NVIDIA A6000
  • Precision: mixed fp16
  • Optimizer: AdamW

Evaluation

  • Test set: ReportAId dataset (197.7m, specific domain)
  • Metric: WER
  • Result: 4.5%

Inference speed

On a machine with 4 CPU cores using Faster-Whisper, MedWhisper Large ITA processes audio in 50–70% of the input duration (RTF β‰ˆ 0.5–0.7), which is roughly 2Γ— more efficient than running the original model with Transformers. This means that a 1-minute audio file can be transcribed in ~30–42 seconds, making the model suitable for near real-time transcription even on CPU-only setups. The equivalent version of MedWhisper Large ITA supported by Faster-Whisper is available as ReportAId/medwhisper-large-v3-ita-ct2.


Model Card Contact

How to Get Started with the Model

import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# Load model and processor
model_id = "ReportAId/medwhisper-large-v3-ita"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()

# Load an audio file (16kHz, mono)
speech_array, sampling_rate = torchaudio.load("sample.wav")
if sampling_rate != 16000:
    speech_array = torchaudio.functional.resample(speech_array, sampling_rate, 16000)

# Preprocess
inputs = processor(speech_array.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

How to use the model with Faster-Whisper (~2 times more efficient)

from faster_whisper import WhisperModel

model=WhisperModel('ReportAId/medwhisper-large-v3-ita-ct2')


segments, info = model.transcribe('audio.mp3')

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

πŸš€ Try it out

You can try MedWhisper Large ITA directly in your browser through an interactive demo.

Open in Spaces

Upload or record audio to experience near real-time transcription in Italian β€” no installation required.

Downloads last month
165
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ReportAId/medwhisper-large-v3-ita

Finetuned
(360)
this model

Space using ReportAId/medwhisper-large-v3-ita 1