File size: 14,182 Bytes
878dbce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6a5c0357",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Ensure datasets is installed from main. Uncomment the following line if you face issues running this script:\n",
    "# !pip install git+https://github.com/huggingface/datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "794aaced",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import Audio, interleave_datasets, IterableDataset, load_dataset\n",
    "from typing import List, Optional"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f210ca9a-486b-46a2-a675-2526a9bd83f5",
   "metadata": {},
   "source": [
    "### Define the dataset attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc07293f-3ba4-4e89-a4ca-8e39409a8373",
   "metadata": {},
   "source": [
    "In this example, we'll show to combine the Common Voice 11, VoxPopuli, Mulitlingual LibriSpeech and FLEURS datasets for Spanish, giving a training corpus equal to the sum of the individual datasets. This is particularly beneficial in low-resource settings, where any one of the datasets alone might have insufficient data to train a model.\n",
    "\n",
    "We need to specify the dataset names on the Hub, the corresponding configs and finally the text column names for the transcriptions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c53344f3-c315-430a-a2f3-57aea6bb0e17",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_names = [\"mozilla-foundation/common_voice_11_0\", \"facebook/voxpopuli\", \"facebook/multilingual_librispeech\", \"google/fleurs\"]\n",
    "dataset_config_names = [\"es\", \"es\", \"spanish\", \"es_419\"]\n",
    "text_column_names = [\"sentence\", \"normalized_text\", \"text\", \"transcription\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "215541f6-ee1c-4104-b43c-fa3f7fce0494",
   "metadata": {},
   "source": [
    "### Define the merging function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b722a48b-c576-4a63-b2a2-3c264890a75f",
   "metadata": {},
   "source": [
    "We define a function, `load_multiple_streaming_datasets`, that takes as argument a list of datasets, configs, splits (optional) and text column names (optional). It sets them to a specified sampling rate and interleaves them together, giving one merged dataset. This is all \n",
    "done in _streaming mode_: as we iterate over the merged dataset we load samples one-by-one on the fly. No data is\n",
    "saved to disk.\n",
    "\n",
    "We can also specify our strategy for interleaving datasets. The default strategy, `all_exhausted` is an oversampling \n",
    "strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset \n",
    "has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the \n",
    "beginning of this dataset until the stop criterion has been reached. You can specify `stopping_strategy=first_exhausted` \n",
    "for a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "61eb4cb1-ee27-4270-a474-1bb33e1df65f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_multiple_streaming_datasets(\n",
    "    dataset_names: List,\n",
    "    dataset_config_names: List,\n",
    "    splits: Optional[List] = None,\n",
    "    text_column_names: Optional[List] = None,\n",
    "    sampling_rate: Optional[int] = 16000,\n",
    "    stopping_strategy: Optional[str] = \"all_exhausted\",\n",
    "    **kwargs\n",
    ") -> IterableDataset:\n",
    "\n",
    "    if len(dataset_names) != len(dataset_config_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one config is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
    "            f\" {len(dataset_config_names)} configs.\"\n",
    "        )\n",
    "\n",
    "    if splits is not None and len(splits) != len(dataset_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one split is passed for each dataset, got {len(dataset_names)} datasets and {len(splits)} splits.\"\n",
    "        )\n",
    "\n",
    "    if text_column_names is not None and len(text_column_names) != len(dataset_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
    "            f\" {len(text_column_names)} text column names.\"\n",
    "        )\n",
    "\n",
    "    splits = splits if splits is not None else [\"train\" for i in range(len(dataset_names))]\n",
    "    text_column_names = (\n",
    "        text_column_names if text_column_names is not None else [\"text\" for i in range(len(dataset_names))]\n",
    "    )\n",
    "\n",
    "    all_datasets = []\n",
    "    # iterate over the datasets we want to interleave\n",
    "    for i, dataset_name in enumerate(dataset_names):\n",
    "        dataset = load_dataset(dataset_name, dataset_config_names[i], split=splits[i], streaming=True, **kwargs)\n",
    "        # resample to specified sampling rate\n",
    "        dataset = dataset.cast_column(\"audio\", Audio(sampling_rate))\n",
    "        #  normalise columns to [\"audio\", \"sentence\"]\n",
    "        if text_column_names[i] != \"sentence\":\n",
    "            dataset = dataset.rename_column(text_column_names[i], \"sentence\")\n",
    "        dataset = dataset.remove_columns(set(dataset.features.keys()) - set([\"audio\", \"sentence\"]))\n",
    "        all_datasets.append(dataset)\n",
    "\n",
    "    interleaved_dataset = interleave_datasets(all_datasets, stopping_strategy=stopping_strategy)\n",
    "    return interleaved_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29bc228b-ce9b-4cee-9092-1223ddfa51ad",
   "metadata": {},
   "source": [
    "Let's apply this function to load and merge our four datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8ae90f83-4ecd-46a3-98be-bd75706e0d88",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = load_multiple_streaming_datasets(dataset_names, dataset_config_names=dataset_config_names, text_column_names=text_column_names, use_auth_token=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6056a693-1fb0-45f4-ad43-be5f1812c1a5",
   "metadata": {},
   "source": [
    "### Iterate over the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ffe011f-f905-4027-ab67-5c9c3b2b5ac0",
   "metadata": {},
   "source": [
    "We iterate over the dataset, loading and merging samples on the fly. Let's print the transcriptions for the first 10 samples of our merged dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "75b3355a-3c06-4d23-af43-2b93b1ad70b2",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reading metadata...: 230467it [00:41, 5545.80it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 ¿ Qué tal a tres de cinco ?\n",
      "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista.\n",
      "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
      "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
      "4 vamos , quiero decir , que no soy de citas especiales .\n",
      "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles.\n",
      "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
      "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
      "8 fray Lope , en aquel momento , colmaba otro vaso igual :\n",
      "9 señora presidenta la competitividad es importante pero no puede ser el único criterio.\n"
     ]
    }
   ],
   "source": [
    "for i, sample in enumerate(ds):\n",
    "    print(i, sample[\"sentence\"])\n",
    "    if i == 9:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42d5ad08-b20e-4cba-a1a9-909fdbf030d4",
   "metadata": {},
   "source": [
    "We can see that the transcriptions take several different formats. Those from Common Voice 11 are cased and punctuated. Those from VoxPopuli are punctuated only. Those from Multilingual LibriSpeech and FLEURS are neither cased not punctuated. We need to normalise the transcriptions to a uniform format before training our model. \n",
    "\n",
    "The following code cell is lifted from the Whisper training notebook: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ed20e9cd-31c2-44cb-872b-333378a92fd1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/sanchitgandhi/venv/lib/python3.8/site-packages/jax/_src/lib/__init__.py:33: UserWarning: JAX on Mac ARM machines is experimental and minimally tested. Please see https://github.com/google/jax/issues/5501 in the event of problems.\n",
      "  warnings.warn(\"JAX on Mac ARM machines is experimental and minimally tested. \"\n"
     ]
    }
   ],
   "source": [
    "from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
    "\n",
    "do_lower_case = True\n",
    "do_remove_punctuation = True\n",
    "\n",
    "normalizer = BasicTextNormalizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01d13029-c24f-4a51-aff2-9251a2ceb4ce",
   "metadata": {},
   "source": [
    "Now we define a function to normalise our transcriptions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "26e42417-4bd2-46f8-914e-3a6f9f3471ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_transcriptions(batch):\n",
    "    # optional pre-processing steps\n",
    "    transcription = batch[\"sentence\"]\n",
    "    if do_lower_case:\n",
    "        transcription = transcription.lower()\n",
    "    if do_remove_punctuation:\n",
    "        transcription = normalizer(transcription).strip()\n",
    "    batch[\"sentence\"] = transcription\n",
    "    return batch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1c67fe-be4b-4ee5-9a1f-0d444f2b5c62",
   "metadata": {},
   "source": [
    "Let's apply the data pre-processing steps to our dataset and view the first 10 samples again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0babac71-9157-4d0f-a8a8-184547bdf501",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reading metadata...: 230467it [00:32, 6984.59it/s] \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0  qué tal a tres de cinco \n",
      "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista \n",
      "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
      "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
      "4 vamos quiero decir que no soy de citas especiales \n",
      "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles \n",
      "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
      "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
      "8 fray lope en aquel momento colmaba otro vaso igual \n",
      "9 señora presidenta la competitividad es importante pero no puede ser el único criterio \n"
     ]
    }
   ],
   "source": [
    "ds = ds.map(normalize_transcriptions)\n",
    "\n",
    "for i, sample in enumerate(ds):\n",
    "    print(i, sample[\"sentence\"])\n",
    "    if i == 9:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d135627a-a7aa-458c-94b8-57ddeae74a72",
   "metadata": {},
   "source": [
    "This time the transcriptions are in a consistent format. We can use this data to fine-tune our Whisper model. Note that since we've removed punctuation and casing, the Whisper model won't learn to predict these features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}