amiriparian
/

ExHuBERT

@@ -39,14 +39,14 @@ Further details are available in the corresponding [**paper**](https://arxiv.org
 |     |    |     |    |     |
 | :---:   | :---: | :---: | :---: | :---: |
-| ABC | AD    | BES    | CASIA   | CVE    |
-| Crema-D | DES   | DEMoS   | EA-ACT   | EA-BMW   |
-| EA-WSJ | EMO-DB    | EmoFilm    | EmotiW-2014   | EMOVO    |
-| eNTERFACE | ESD    | EU-EmoSS    | EU-EV   | FAU Aibo    |
-| GEMEP | GVESS    | IEMOCAP    | MES   |   MESD  |
-| MELD |   PPMMK  |  RAVDESS   |  SAVEE  |   ShEMO  |
-| SmartKom |   SIMIS  |  SUSAS   |  SUBSECO  |   TESS  |
-| TurkishEmo |  Urdu   |     |    |     |
@@ -60,9 +60,11 @@ from transformers import AutoModelForAudioClassification, Wav2Vec2FeatureExtract
 # CONFIG and MODEL SETUP
-model_name = 'amiriparian/HuBERT-EmoSet'
 feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960")
 model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812")
 model.freeze_og_encoder()
 sampling_rate=16000
@@ -88,4 +90,205 @@ model = model.to(device)
   month = {September},
   publisher = {ISCA},
 }
-```

 |     |    |     |    |     |
 | :---:   | :---: | :---: | :---: | :---: |
+| ABC [[1]](#1)| AD [[2]](#2)   | BES [[3]](#3)    | CASIA [[4]](#4)  | CVE [[5]](#5)   |
+| Crema-D [[6]](#6)| DES [[7]](#)  | DEMoS [[8]](#8)   | EA-ACT [[9]](#9)  | EA-BMW  [[9]](#9) |
+| EA-WSJ [[9]](#9) | EMO-DB  [[10]](#10)  | EmoFilm [[11]](#11)   | EmotiW-2014 [[12]](#12)  | EMOVO [[13]](#13)   |
+| eNTERFACE [[14]](#14) | ESD  [[15]](#15)  | EU-EmoSS  [[16]](#16)  | EU-EV [[17]](#17)  | FAU Aibo [[18]](#18)   |
+| GEMEP [[19]](#19) | GVESS [[20]](#20)   | IEMOCAP [[21]](#21)   | MES [[3]](#3)  |   MESD [[22]](#22)  |
+| MELD [[23]](#23)|   PPMMK [[2]](#2) |  RAVDESS [[24]](#24)  |  SAVEE [[25]](#25) |   ShEMO [[26]](#26) |
+| SmartKom [[27]](#27) |   SIMIS [[28]](#28)  |  SUSAS [[29]](#29)  |  SUBSECO [[30]](#30) |   TESS [[31]](#31) |
+| TurkishEmo [[2]](#2) |  Urdu  [[32]](#32) |     |    |     |
 # CONFIG and MODEL SETUP
+model_name = 'amiriparian/ExHuBERT'
 feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960")
 model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812")
+# Freezing half of the encoder
 model.freeze_og_encoder()
 sampling_rate=16000
   month = {September},
   publisher = {ISCA},
 }
+```
+### References
+<a id="1">[1]</a>
+B. Schuller, D. Arsic, G. Rigoll, M. Wimmer, and B. Radig. Audiovisual Behavior
+Modeling by Combined Feature Spaces. In 2007 IEEE International Conference on
+Acoustics, Speech and Signal Processing - ICASSP ’07, volume 2, pages II–733–II–
+736, Apr. 2007.
+<a id="2">[2]</a>
+M. Gerczuk, S. Amiriparian, S. Ottl, and B. W. Schuller. EmoNet: A Transfer
+Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans-
+actions on Affective Computing, 14(2):1472–1487, Apr. 2023.
+<a id="3">[3]</a>
+T. L. Nwe, S. W. Foo, and L. C. De Silva. Speech emotion recognition using hidden
+Markov models. Speech Communication, 41(4):603–623, Nov. 2003.
+<a id="4">[4]</a>
+The selected speech emotion database of institute of automation chineseacademy of
+sciences (casia). http://www.chineseldc.org/resource_info.php?rid=76. accessed March 2024.
+<a id="5">[5]</a>
+P. Liu and M. D. Pell. Recognizing vocal emotions in Mandarin Chinese: A val-
+idated database of Chinese vocal emotional stimuli. Behavior Research Methods,
+44(4):1042–1051, Dec. 2012.
+<a id="6">[6]</a>
+H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma.
+CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
+<a id="7">[7]</a>
+I. S. Engberg, A. V. Hansen, O. K. Andersen, and P. Dalsgaard. Design Record-
+ing and Verification of a Danish Emotional Speech Database: Design Recording
+and Verification of a Danish Emotional Speech Database. EUROSPEECH’97 : 5th
+European Conference on Speech Communication and Technology, Patras, Rhodes,
+Greece, 22-25 September 1997, pages Vol. 4, pp. 1695–1698, 1997.
+<a id="8">[8]</a>
+E. Parada-Cabaleiro, G. Costantini, A. Batliner, M. Schmitt, and B. W. Schuller.
+DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation,
+54(2):341–383, June 2020.
+<a id="9">[9]</a>
+B. Schuller. Automatische Emotionserkennung Aus Sprachlicher Und Manueller
+Interaktion. PhD thesis, Technische Universit¨at M¨unchen, 2006.
+<a id="10">[10]</a>
+F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database
+of German emotional speech. In Interspeech 2005, pages 1517–1520. ISCA, Sept.
+2005.
+<a id="11">[11]</a>
+ E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller.
+ Categorical vs Dimensional Perception of Italian Emotional Speech. In Interspeech 2018,
+pages 3638–3642. ISCA, Sept. 2018.
+<a id="12">[12]</a>
+A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion Recognition In
+The Wild Challenge 2014: Baseline, Data and Protocol. In Proceedings of the 16th
+International Conference on Multimodal Interaction, ICMI ’14, pages 461–466, New
+York, NY, USA, Nov. 2014. Association for Computing Machinery.
+<a id="13">[13]</a>
+G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco. EMOVO Corpus: An Italian
+Emotional Speech Database. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson,
+B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceed-
+ings of the Ninth International Conference on Language Resources and Evaluation
+(LREC’14), pages 3501–3504, Reykjavik, Iceland, May 2014. European Language
+Resources Association (ELRA).
+<a id="14">[14]</a>
+O. Martin, I. Kotsia, B. Macq, and I. Pitas. The eNTERFACE’ 05 Audio-Visual
+Emotion Database. In 22nd International Conference on Data Engineering Work-
+shops (ICDEW’06), pages 8–8, Apr. 2006.
+<a id="15">[15]</a>
+K. Zhou, B. Sisman, R. Liu, and H. Li. Seen and Unseen emotional style transfer
+for voice conversion with a new emotional speech dataset, Feb. 2021.
+<a id="16">[16]</a>
+H. O’Reilly, D. Pigat, S. Fridenson, S. Berggren, S. Tal, O. Golan, S. B¨olte, S. Baron-
+Cohen, and D. Lundqvist. The EU-Emotion Stimulus Set: A validation study.
+Behavior Research Methods, 48(2):567–576, June 2016.
+<a id="17">[17]</a>
+A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfstr¨om,
+A. R˚ade, O. Golan, S. B¨olte, S. Baron-Cohen, and D. Lundqvist. The EU-Emotion
+Voice Database. Behavior Research Methods, 51(2):493–506, Apr. 2019.
+<a id="18">[18]</a>
+A. Batliner, S. Steidl, and E. Noth. Releasing a thoroughly annotated and processed
+spontaneous emotional database: The FAU Aibo Emotion Corpus. 2008.
+<a id="19">[19]</a>
+K. R. Scherer, T. B¨anziger, and E. Roesch. A Blueprint for Affective Computing:
+A Sourcebook and Manual. OUP Oxford, Sept. 2010.
+<a id="20">[20]</a>
+R. Banse and K. R. Scherer. Acoustic profiles in vocal emotion expression. Journal
+of Personality and Social Psychology, 70(3):614–636, 1996.
+<a id="21">[21]</a>
+C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang,
+S. Lee, and S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion
+capture database. Language Resources and Evaluation, 42(4):335–359, Dec. 2008.
+<a id="22">[22]</a>
+M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. The Mexican Emo-
+tional Speech Database (MESD): Elaboration and assessment based on machine
+learning. Annual International Conference of the IEEE Engineering in Medicine
+and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual
+International Conference, 2021:1644–1647, Nov. 2021.
+<a id="23">[23]</a>
+S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD:
+A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, June
+2019.
+<a id="24">[24]</a>
+S. R. Livingstone and F. A. Russo. The Ryerson Audio-Visual Database of Emo-
+tional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal
+expressions in North American English. PLOS ONE, 13(5):e0196391, May 2018.
+<a id="25">[25]</a>
+S. Haq and P. J. B. Jackson. Speaker-dependent audio-visual emotion recognition.
+In Proc. AVSP 2009, pages 53–58, 2009.
+<a id="26">[26]</a>
+O. Mohamad Nezami, P. Jamshid Lou, and M. Karami. ShEMO: A large-scale
+validated database for Persian speech emotion detection. Language Resources and
+Evaluation, 53(1):1–16, Mar. 2019.
+<a id="27">[27]</a>
+F. Schiel, S. Steininger, and U. T¨urk. The SmartKom Multimodal Corpus at BAS. In
+M. Gonz´alez Rodr´ıguez and C. P. Suarez Araujo, editors, Proceedings of the Third
+International Conference on Language Resources and Evaluation (LREC’02), Las
+Palmas, Canary Islands - Spain, May 2002. European Language Resources Association (ELRA).
+<a id="28">[28]</a>
+B. Schuller, F. Eyben, S. Can, and H. Feußner. Speech in Minimal Invasive Surgery
+- Towards an Affective Language Resource of Real-life Medical Operations. 2010.
+<a id="29">[29]</a>
+J. H. L. Hansen and S. E. Bou-Ghazale. Getting started with SUSAS: A speech under
+simulated and actual stress database. In Proc. Eurospeech 1997, pages 1743–1746,
+1997.
+<a id="30">[30]</a>
+S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. SUST Bangla Emotional
+Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.
+PLOS ONE, 16(4):e0250173, Apr. 2021.
+<a id="31">[31]</a>
+M. K. Pichora-Fuller and K. Dupuis. Toronto emotional speech set (TESS), Feb.
+2020.
+<a id="32">[32]</a>
+S. Latif, A. Qayyum, M. Usman, and J. Qadir. Cross Lingual Speech Emotion
+Recognition: Urdu vs. Western Languages. In 2018 International Conference on
+Frontiers of Information Technology (FIT), pages 88–93, Dec. 2018.