amiriparian
commited on
Commit
•
919bfb2
1
Parent(s):
f4f5ec5
Update README.md
Browse files
README.md
CHANGED
@@ -39,14 +39,14 @@ Further details are available in the corresponding [**paper**](https://arxiv.org
|
|
39 |
|
40 |
| | | | | |
|
41 |
| :---: | :---: | :---: | :---: | :---: |
|
42 |
-
| ABC | AD
|
43 |
-
| Crema-D | DES
|
44 |
-
| EA-WSJ | EMO-DB
|
45 |
-
| eNTERFACE | ESD
|
46 |
-
| GEMEP | GVESS
|
47 |
-
| MELD | PPMMK
|
48 |
-
| SmartKom | SIMIS | SUSAS
|
49 |
-
| TurkishEmo | Urdu
|
50 |
|
51 |
|
52 |
|
@@ -60,9 +60,11 @@ from transformers import AutoModelForAudioClassification, Wav2Vec2FeatureExtract
|
|
60 |
|
61 |
|
62 |
# CONFIG and MODEL SETUP
|
63 |
-
model_name = 'amiriparian/
|
64 |
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960")
|
65 |
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812")
|
|
|
|
|
66 |
model.freeze_og_encoder()
|
67 |
|
68 |
sampling_rate=16000
|
@@ -88,4 +90,205 @@ model = model.to(device)
|
|
88 |
month = {September},
|
89 |
publisher = {ISCA},
|
90 |
}
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
| | | | | |
|
41 |
| :---: | :---: | :---: | :---: | :---: |
|
42 |
+
| ABC [[1]](#1)| AD [[2]](#2) | BES [[3]](#3) | CASIA [[4]](#4) | CVE [[5]](#5) |
|
43 |
+
| Crema-D [[6]](#6)| DES [[7]](#) | DEMoS [[8]](#8) | EA-ACT [[9]](#9) | EA-BMW [[9]](#9) |
|
44 |
+
| EA-WSJ [[9]](#9) | EMO-DB [[10]](#10) | EmoFilm [[11]](#11) | EmotiW-2014 [[12]](#12) | EMOVO [[13]](#13) |
|
45 |
+
| eNTERFACE [[14]](#14) | ESD [[15]](#15) | EU-EmoSS [[16]](#16) | EU-EV [[17]](#17) | FAU Aibo [[18]](#18) |
|
46 |
+
| GEMEP [[19]](#19) | GVESS [[20]](#20) | IEMOCAP [[21]](#21) | MES [[3]](#3) | MESD [[22]](#22) |
|
47 |
+
| MELD [[23]](#23)| PPMMK [[2]](#2) | RAVDESS [[24]](#24) | SAVEE [[25]](#25) | ShEMO [[26]](#26) |
|
48 |
+
| SmartKom [[27]](#27) | SIMIS [[28]](#28) | SUSAS [[29]](#29) | SUBSECO [[30]](#30) | TESS [[31]](#31) |
|
49 |
+
| TurkishEmo [[2]](#2) | Urdu [[32]](#32) | | | |
|
50 |
|
51 |
|
52 |
|
|
|
60 |
|
61 |
|
62 |
# CONFIG and MODEL SETUP
|
63 |
+
model_name = 'amiriparian/ExHuBERT'
|
64 |
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960")
|
65 |
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812")
|
66 |
+
|
67 |
+
# Freezing half of the encoder
|
68 |
model.freeze_og_encoder()
|
69 |
|
70 |
sampling_rate=16000
|
|
|
90 |
month = {September},
|
91 |
publisher = {ISCA},
|
92 |
}
|
93 |
+
|
94 |
+
|
95 |
+
```
|
96 |
+
|
97 |
+
### References
|
98 |
+
|
99 |
+
<a id="1">[1]</a>
|
100 |
+
B. Schuller, D. Arsic, G. Rigoll, M. Wimmer, and B. Radig. Audiovisual Behavior
|
101 |
+
Modeling by Combined Feature Spaces. In 2007 IEEE International Conference on
|
102 |
+
Acoustics, Speech and Signal Processing - ICASSP ’07, volume 2, pages II–733–II–
|
103 |
+
736, Apr. 2007.
|
104 |
+
|
105 |
+
|
106 |
+
<a id="2">[2]</a>
|
107 |
+
M. Gerczuk, S. Amiriparian, S. Ottl, and B. W. Schuller. EmoNet: A Transfer
|
108 |
+
Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans-
|
109 |
+
actions on Affective Computing, 14(2):1472–1487, Apr. 2023.
|
110 |
+
|
111 |
+
|
112 |
+
<a id="3">[3]</a>
|
113 |
+
T. L. Nwe, S. W. Foo, and L. C. De Silva. Speech emotion recognition using hidden
|
114 |
+
Markov models. Speech Communication, 41(4):603–623, Nov. 2003.
|
115 |
+
|
116 |
+
|
117 |
+
<a id="4">[4]</a>
|
118 |
+
The selected speech emotion database of institute of automation chineseacademy of
|
119 |
+
sciences (casia). http://www.chineseldc.org/resource_info.php?rid=76. accessed March 2024.
|
120 |
+
|
121 |
+
|
122 |
+
<a id="5">[5]</a>
|
123 |
+
P. Liu and M. D. Pell. Recognizing vocal emotions in Mandarin Chinese: A val-
|
124 |
+
idated database of Chinese vocal emotional stimuli. Behavior Research Methods,
|
125 |
+
44(4):1042–1051, Dec. 2012.
|
126 |
+
|
127 |
+
|
128 |
+
<a id="6">[6]</a>
|
129 |
+
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma.
|
130 |
+
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
|
131 |
+
|
132 |
+
|
133 |
+
|
134 |
+
<a id="7">[7]</a>
|
135 |
+
I. S. Engberg, A. V. Hansen, O. K. Andersen, and P. Dalsgaard. Design Record-
|
136 |
+
ing and Verification of a Danish Emotional Speech Database: Design Recording
|
137 |
+
and Verification of a Danish Emotional Speech Database. EUROSPEECH’97 : 5th
|
138 |
+
European Conference on Speech Communication and Technology, Patras, Rhodes,
|
139 |
+
Greece, 22-25 September 1997, pages Vol. 4, pp. 1695–1698, 1997.
|
140 |
+
|
141 |
+
|
142 |
+
|
143 |
+
<a id="8">[8]</a>
|
144 |
+
E. Parada-Cabaleiro, G. Costantini, A. Batliner, M. Schmitt, and B. W. Schuller.
|
145 |
+
DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation,
|
146 |
+
54(2):341–383, June 2020.
|
147 |
+
|
148 |
+
|
149 |
+
<a id="9">[9]</a>
|
150 |
+
B. Schuller. Automatische Emotionserkennung Aus Sprachlicher Und Manueller
|
151 |
+
Interaktion. PhD thesis, Technische Universit¨at M¨unchen, 2006.
|
152 |
+
|
153 |
+
|
154 |
+
<a id="10">[10]</a>
|
155 |
+
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database
|
156 |
+
of German emotional speech. In Interspeech 2005, pages 1517–1520. ISCA, Sept.
|
157 |
+
2005.
|
158 |
+
|
159 |
+
|
160 |
+
<a id="11">[11]</a>
|
161 |
+
E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller.
|
162 |
+
Categorical vs Dimensional Perception of Italian Emotional Speech. In Interspeech 2018,
|
163 |
+
pages 3638–3642. ISCA, Sept. 2018.
|
164 |
+
|
165 |
+
|
166 |
+
<a id="12">[12]</a>
|
167 |
+
A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion Recognition In
|
168 |
+
The Wild Challenge 2014: Baseline, Data and Protocol. In Proceedings of the 16th
|
169 |
+
International Conference on Multimodal Interaction, ICMI ’14, pages 461–466, New
|
170 |
+
York, NY, USA, Nov. 2014. Association for Computing Machinery.
|
171 |
+
|
172 |
+
|
173 |
+
<a id="13">[13]</a>
|
174 |
+
G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco. EMOVO Corpus: An Italian
|
175 |
+
Emotional Speech Database. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson,
|
176 |
+
B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceed-
|
177 |
+
ings of the Ninth International Conference on Language Resources and Evaluation
|
178 |
+
(LREC’14), pages 3501–3504, Reykjavik, Iceland, May 2014. European Language
|
179 |
+
Resources Association (ELRA).
|
180 |
+
|
181 |
+
|
182 |
+
|
183 |
+
<a id="14">[14]</a>
|
184 |
+
O. Martin, I. Kotsia, B. Macq, and I. Pitas. The eNTERFACE’ 05 Audio-Visual
|
185 |
+
Emotion Database. In 22nd International Conference on Data Engineering Work-
|
186 |
+
shops (ICDEW’06), pages 8–8, Apr. 2006.
|
187 |
+
|
188 |
+
|
189 |
+
|
190 |
+
|
191 |
+
<a id="15">[15]</a>
|
192 |
+
K. Zhou, B. Sisman, R. Liu, and H. Li. Seen and Unseen emotional style transfer
|
193 |
+
for voice conversion with a new emotional speech dataset, Feb. 2021.
|
194 |
+
|
195 |
+
|
196 |
+
|
197 |
+
<a id="16">[16]</a>
|
198 |
+
H. O’Reilly, D. Pigat, S. Fridenson, S. Berggren, S. Tal, O. Golan, S. B¨olte, S. Baron-
|
199 |
+
Cohen, and D. Lundqvist. The EU-Emotion Stimulus Set: A validation study.
|
200 |
+
Behavior Research Methods, 48(2):567–576, June 2016.
|
201 |
+
|
202 |
+
|
203 |
+
|
204 |
+
<a id="17">[17]</a>
|
205 |
+
A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfstr¨om,
|
206 |
+
A. R˚ade, O. Golan, S. B¨olte, S. Baron-Cohen, and D. Lundqvist. The EU-Emotion
|
207 |
+
Voice Database. Behavior Research Methods, 51(2):493–506, Apr. 2019.
|
208 |
+
|
209 |
+
|
210 |
+
<a id="18">[18]</a>
|
211 |
+
A. Batliner, S. Steidl, and E. Noth. Releasing a thoroughly annotated and processed
|
212 |
+
spontaneous emotional database: The FAU Aibo Emotion Corpus. 2008.
|
213 |
+
|
214 |
+
|
215 |
+
<a id="19">[19]</a>
|
216 |
+
K. R. Scherer, T. B¨anziger, and E. Roesch. A Blueprint for Affective Computing:
|
217 |
+
A Sourcebook and Manual. OUP Oxford, Sept. 2010.
|
218 |
+
|
219 |
+
|
220 |
+
<a id="20">[20]</a>
|
221 |
+
R. Banse and K. R. Scherer. Acoustic profiles in vocal emotion expression. Journal
|
222 |
+
of Personality and Social Psychology, 70(3):614–636, 1996.
|
223 |
+
|
224 |
+
|
225 |
+
<a id="21">[21]</a>
|
226 |
+
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang,
|
227 |
+
S. Lee, and S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion
|
228 |
+
capture database. Language Resources and Evaluation, 42(4):335–359, Dec. 2008.
|
229 |
+
|
230 |
+
<a id="22">[22]</a>
|
231 |
+
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. The Mexican Emo-
|
232 |
+
tional Speech Database (MESD): Elaboration and assessment based on machine
|
233 |
+
learning. Annual International Conference of the IEEE Engineering in Medicine
|
234 |
+
and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual
|
235 |
+
International Conference, 2021:1644–1647, Nov. 2021.
|
236 |
+
|
237 |
+
<a id="23">[23]</a>
|
238 |
+
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD:
|
239 |
+
A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, June
|
240 |
+
2019.
|
241 |
+
|
242 |
+
<a id="24">[24]</a>
|
243 |
+
S. R. Livingstone and F. A. Russo. The Ryerson Audio-Visual Database of Emo-
|
244 |
+
tional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal
|
245 |
+
expressions in North American English. PLOS ONE, 13(5):e0196391, May 2018.
|
246 |
+
|
247 |
+
|
248 |
+
<a id="25">[25]</a>
|
249 |
+
S. Haq and P. J. B. Jackson. Speaker-dependent audio-visual emotion recognition.
|
250 |
+
In Proc. AVSP 2009, pages 53–58, 2009.
|
251 |
+
|
252 |
+
|
253 |
+
<a id="26">[26]</a>
|
254 |
+
O. Mohamad Nezami, P. Jamshid Lou, and M. Karami. ShEMO: A large-scale
|
255 |
+
validated database for Persian speech emotion detection. Language Resources and
|
256 |
+
Evaluation, 53(1):1–16, Mar. 2019.
|
257 |
+
|
258 |
+
|
259 |
+
<a id="27">[27]</a>
|
260 |
+
F. Schiel, S. Steininger, and U. T¨urk. The SmartKom Multimodal Corpus at BAS. In
|
261 |
+
M. Gonz´alez Rodr´ıguez and C. P. Suarez Araujo, editors, Proceedings of the Third
|
262 |
+
International Conference on Language Resources and Evaluation (LREC’02), Las
|
263 |
+
Palmas, Canary Islands - Spain, May 2002. European Language Resources Association (ELRA).
|
264 |
+
|
265 |
+
|
266 |
+
<a id="28">[28]</a>
|
267 |
+
B. Schuller, F. Eyben, S. Can, and H. Feußner. Speech in Minimal Invasive Surgery
|
268 |
+
- Towards an Affective Language Resource of Real-life Medical Operations. 2010.
|
269 |
+
|
270 |
+
|
271 |
+
<a id="29">[29]</a>
|
272 |
+
J. H. L. Hansen and S. E. Bou-Ghazale. Getting started with SUSAS: A speech under
|
273 |
+
simulated and actual stress database. In Proc. Eurospeech 1997, pages 1743–1746,
|
274 |
+
1997.
|
275 |
+
|
276 |
+
|
277 |
+
|
278 |
+
<a id="30">[30]</a>
|
279 |
+
S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. SUST Bangla Emotional
|
280 |
+
Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla.
|
281 |
+
PLOS ONE, 16(4):e0250173, Apr. 2021.
|
282 |
+
|
283 |
+
|
284 |
+
<a id="31">[31]</a>
|
285 |
+
M. K. Pichora-Fuller and K. Dupuis. Toronto emotional speech set (TESS), Feb.
|
286 |
+
2020.
|
287 |
+
|
288 |
+
|
289 |
+
|
290 |
+
<a id="32">[32]</a>
|
291 |
+
S. Latif, A. Qayyum, M. Usman, and J. Qadir. Cross Lingual Speech Emotion
|
292 |
+
Recognition: Urdu vs. Western Languages. In 2018 International Conference on
|
293 |
+
Frontiers of Information Technology (FIT), pages 88–93, Dec. 2018.
|
294 |
+
|