metadata

library_name: transformers
tags:
  - automatic-speech-recognition
  - contrastive-learning
  - synthetic-data-filtering
license: apache-2.0
datasets:
  - mozilla-foundation/common_voice_17_0
  - facebook/multilingual_librispeech
language:
  - pt
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition

Model Card for Finetuned Version of Whisper-Small

This model was trained on a subset of the synthetically generated data that later on was filtered to increase the performance of Whisper Model. The approach involves aligning representations of synthetic audio and corresponding text transcripts to identify and remove low-quality samples, improving the overall training data quality.

In this Specific Model we used 82,32% of synthetic data generated by SeamllesMT4LargeV2, the rest was removed by the filtering model. The training set also contained, the CommonVoice Dataset, Multilibri Speach, and Bracarense (Fully Portuguese Dialect)

Model Details

Developed by: Yuriy Perezhohin, Tiago Santos, Victor Costa, Fernando Peres, and Mauro Castelli.
Funded by: MyNorth AI Research
Shared by: MyNorth AI Research
Model type: ASR with contrastive learning-based synthetic data filtering
Language: Portuguese
License: APACHE 2.0
Finetuned from model: Whisper Small

Model Sources

Repository: https://github.com/my-north-ai/semantic_audio_filtering
Paper: Comming Soon

Uses

This model can be directly used for improving ASR systems in Portuguese, particularly in scenarios with limited real-world data or unique linguistic characteristics.

Out-of-Scope Use

The model is not suitable for tasks involving languages other than Portuguese without additional fine-tuning and data adjustments.

Bias, Risks, and Limitations

Users should be aware of potential biases introduced by synthetic data and ensure the quality of the data aligns with the target application's requirements. It is recommended to evaluate the model's performance on diverse datasets to identify and mitigate biases.

How to Get Started with the Model


from transformers import pipeline

model = pipeline("automatic-speech-recognition", model="my-north-ai/semantic_audio_filtering")
result = model("path_to_audio_file.wav")
print(result)

Training Details

Training Data

The training data includes 140 hours of synthetically generated Portuguese speech and transcripts, along with real data from the Multilingual LibriSpeech Corpus (MLS), Common Voice (CV) 16.1, and the Perfil Sociolinguístico da Fala Bracarense (PSFB) dataset

Training Procedure

The model was fine tuned using DDP methodolgy across 4 A10g GPUS

Preprocessin

The preprocessing steps include text normalization, removal of special characters, and ensuring consistent formatting for TTS generation.

Training Hyperparameters

Training regime: fp16 mixed precision
Learning Rate: 1e-5
Batch Size: 32
Epochs 3

Evaluation

Testing Data, Factors & Metrics

Testing Data

The testing data includes subsets from the FLEURS dataset and PSFB, chosen for their linguistic diversity and unique speech patterns.

Metrics

Evaluation Results

Word Error Rate (WER) Comparison

WER results for FLEURS for the fine-tuned model versus pretrained model with and without text normalization.

Model Size	Model Type	WER (Normalized)	WER (Non-Normalized)
Small	Pretrained	10.87	15.43
Small	FS-17.68%	10.45	18.57
Small	FS-3.92%	10.34	18.53
Small	FS-0.24%	10.58	18.90
Small	Zero Synthetic	10.90	19.32
Medium	Pretrained	8.62	12.65
Medium	FS-17.68%	6.58	14.46
Medium	FS-3.92%	6.57	14.44
Medium	FS-0.24%	6.58	14.54
Medium	Zero Synthetic	6.97	14.74
Large V3	Pretrained	7.70	11.78
Large V3	FS-17.68%	4.73	10.83
Large V3	FS-3.92%	4.65	11.09
Large V3	FS-0.24%	4.80	11.28
Large V3	Zero Synthetic	4.86	10.92

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A10G
Hours used: 15
Cloud Provider: AWS
Compute Region: US EAST