Cretan XLS-R model
Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between Greece and Turkey in 1923. The historical and geographical factors that have shaped the development and preservation of the dialect include the long-term isolation of Crete from the mainland, and the successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks, over a period of seven centuries. Cretan has been divided based on its phonological, phonetic, morphological, and lexical characteristics into two major dialect groups: the western and the eastern. The boundary between these groups coincides with the administrative division of the island into the prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more homogeneous than the western one, which shows more variation across all levels of linguistic analysis. Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains the sole means of communication for a large number of speakers in various parts of the island.
This is the first automatic speech recognition (ASR) model for Cretan. To train the model, we fine-tuned a Greek XLS-R model (jonatasgrosman/wav2vec2-large-xlsr-53-greek) on the Cretan resources (see below).
Resources
For the compilation of the Cretan corpus, we gathered 32 tapes containing material from
radio broadcasts in digital format, with permission from the Audiovisual Department of the
Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and
aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001,
totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives
by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms
of textual genre, the linguistic content of the broadcasts consists of folklore
narratives expressed in the local linguistic variety. Out of the total volume of material
collected, we utilized nine tapes. Criteria for material selection included, on the one hand,
maximizing digital clarity of speech and, on the other hand, ensuring representative sampling
across the entire three-year period of radio recordings. To obtain an initial transcription,
we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently,
the transcripts were manually corrected in collaboration with the local community.
The transcription system that was used was based on the Greek alphabet and orthography
and it was annotated in Praat.
To prepare the dataset, the texts were normalized (see greek_dialects_asr/ for scripts), and all audio files were converted into a 16 kHz mono format.
We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s. Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration (compared to the initial 2h recordings of the 9 tapes).
Metrics
We evaluated the model on the test set split, which consists of 10% of the dataset recordings.
Model | WER | CER |
---|---|---|
pre-trained | 104.83% | 91.73% |
fine-tuned | 28.27% | 7.88% |
Training hyperparameters
We fine-tuned the baseline model (wav2vec2-large-xlsr-53-greek
) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:
arg | value |
---|---|
per_device_train_batch_size |
8 |
gradient_accumulation_steps |
2 |
num_train_epochs |
35 |
learning_rate |
3e-4 |
warmup_steps |
500 |
Citation
To cite this work or read more about the training pipeline, see:
S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.
- Downloads last month
- 8