Emotion Detection From Speech

This model is the fine-tuned version of DistilHuBERT which classifies emotions from audio inputs.

Approach

Dataset: The Ravdess dataset, comprising 1,440 audio files with 8 emotion labels: calm, happy, sad, angry, fearful, surprise, neutral, and disgust.
Model Fine-Tuning: The DistilHuBERT model was fine-tuned for 7 epochs with a learning rate of 5e-5, achieving an accuracy of 98% on the test dataset.

Data Preprocessing

Sampling Rate: Audio files were resampled to 16kHz to match the model's requirements.
Padding: Audio clips shorter than 30 seconds were zero-padded.
Train-Test Split: 80% of the samples were used for training, and 20% for testing.

Model Architecture

DistilHuBERT: A lightweight variant of HuBERT, fine-tuned for emotion classification.
Fine-Tuning Setup:
- Optimizer: AdamW
- Loss Function: Cross-Entropy
- Learning Rate: 5e-5
- Warm-up Ratio: 0.1
- Epochs: 7

Results

Accuracy: 0.98 on the test dataset
Loss: 0.10 on the test dataset

Usage

from transformers import pipeline

pipe = pipeline(
    "audio-classification",
    model="BilalHasan/distilhubert-finetuned-ravdess",
)

emotion = pipe(path_to_your_audio)

Demo

You can access the live demo of the app on Hugging Face Spaces.

BilalHasan
/

distilhubert-finetuned-ravdess

Emotion Detection From Speech

Approach

Data Preprocessing

Model Architecture

Results

Usage

Demo

Model tree for BilalHasan/distilhubert-finetuned-ravdess

Space using BilalHasan/distilhubert-finetuned-ravdess 1