Embedding model

#3
by sieucun - opened

Hi author, I have been using your embedding audio code. The problem is, after I embedding 2 wav file, the cosine similarity score of 2 embedding vector is always more than 0.98 regardless of the difference between 2 audio file. Even if I use random noise versus audio file, it's still 0.98. I'm a newbie in Speech. Can you explain me the problem, is there anything that I misunderstood?

Thank you!!

Personally I find that the AudioMAE models are not very good. I think my checkpoint here should reproduce the exact original checkpoint at https://github.com/facebookresearch/AudioMAE. I will double check it again. Also make sure the preprocessing is correct. The input to kaldi.fbank() should be audio normalized in [-1,1] range. See here https://github.com/facebookresearch/AudioMAE/blob/bd60e29651285f80d32a6405082835ad26e6f19f/dataset.py#L176-L228.

It is also possible that this model is not good for cosine similarity retrieval. For speech data, you can try using speaker embedding models from https://github.com/wenet-e2e/wespeaker. I find them pretty good for the purpose.

In my opinion, if the encoder give 0.99 cosine similarity score between two random audio file, even between random noise and audio file, it's mean that the model is quite bad. It's very grateful if you can double check my result, I did not sure I do it right.

Did you have try AudioMAE for downstream task?

I try Wenet, it's work and easy to use, thank you.

Sign up or log in to comment