wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

Model Description

Model Architecture: Wav2Vec2 with sequence classification head
Language: English
Task: Speech Emotion Recognition
Fine-tuned from: facebook/wav2vec2-base
Datasets: Combined emotion datasets
TESS
CREMA-D
SAVEE
RAVDESS

Performance Metrics

Accuracy: 79.57%
F1 Score: 79.43%

Supported Emotions

😠 Angry
😌 Calm
🤢 Disgust
😨 Fearful
😊 Happy
😐 Neutral
😢 Sad
😲 Surprised

Training Details

The model was trained with the following configuration:

Epochs: 15
Batch Size: 16
Learning Rate: 5e-5
Optimizer: AdamW
Weight Decay: 0.03
Gradient Accumulation Steps: 2
Mixed Precision: fp16

For detailed training process, check out the Fine-tuning Notebook

Limitations

Audio Requirements:

Sampling rate: 16kHz (will be automatically resampled)
Maximum duration: 1 minute
Clear speech with minimal background noise recommended

Performance Considerations:

Best results with clear speech audio
Performance may vary with different accents
Background noise can affect accuracy

Demo

https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

Contact

GitHub: DGautam11
LinkedIn: Deepan Gautam
Hugging Face: @Dpngtm

For issues and questions, feel free to:

Open an issue on the Model Repository
Comment on the Demo Space

Usage

from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]

Dpngtm
/

wav2vec2-emotion-recognition