Audio Emotion Detection

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53.

It achieves the following results on the evaluation set:

Loss: 0.9555
Accuracy: 0.6262

Model description

A model that returns Labels for Angry, Disgusted, Fearful, Happy, Neutral, Sad, Suprised. All aduio was trained at a sampling rate of 16000 and all inputs should be transformed to work properly.

Training and evaluation data

mozilla-foundation/common_voice_6_0
speech-recognition-community-v2/dev_data

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.01
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.5875	1.0	40	1.2574	0.5133
1.1637	2.0	80	1.0852	0.5590
0.9827	3.0	120	1.0048	0.6090
0.8683	4.0	160	0.9555	0.6262

Downloads last month: 264

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Hatman/audio-emotion-detection

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(347)

this model