How to use the ASR on LLama3.1
On the papers :
8.1 Data
8.1.1 Speech Understanding
The training data can be categorized into two types. The pre-training data includes a large amount of
unlabeled speech, which is used to initialize the speech encoder in a self-supervised manner. The supervised
finetuning data includes speech recognition, speech translation, and spoken dialogue data; this data is used to
unlock specific abilities when integrated with the large language model.
Pre-training data. To pre-train the speech encoder, we curate a dataset of approximately 15M hours of speech
recordings encompassing a large number of languages. We filter our audio data using a voice activity detection
(VAD) model and select audio samples with a VAD threshold above 0.7 for pre-training. In speech pre-training
data, we also focus on ensuring the absence of PII. We use the Presidio Analyzer to identify such PII.
Speech recognition and translation data. Our ASR training data contains 230K hours of manually transcribed
speech recordings that span 34 languages. Our AST training data contains 90K hours of translations in
two directions: from 33 languages to English and from English to 33 languages. This data contains both
supervised and synthetic data generated using the NLLB toolkit (NLLB Team et al., 2022). The use of
synthetic AST data enables us to increase model quality for low-resource languages. The speech segments in
our data have a maximum length of 60 seconds.
Spoken dialogue data. To finetune the speech adapter for spoken dialogue, we synthetically generate responses
18The speech interface supports the following 34 languages: Arabic, Bengali, Chinese, Czech, Dutch, English, Finnish, French,
German, Greek, Gujarati, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Malayalam, Marathi, Persian,
Polish, Portuguese, Romanian, Russian, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Urdu, Vietnamese.
63
for speech prompts by asking the language model to respond to transcriptions of those prompts (Fathullah
et al., 2024). We generate synthetic data this way using a subset of the ASR dataset with 60K hours of speech.
In addition, we generate 25K hours of synthetic data by running the Voicebox TTS system (Le et al., 2024)
on subsets of the data used to finetune Llama 3. We used several heuristics to select a subset of finetuning
data that matches the distribution of speech. These heuristics include focusing on relatively short prompts
with a simple structure and without non-text symbols.
8.1.2 Speech Generation
The speech generation datasets mainly consist of those for training the text normalization (TN) model and
the prosody model (PM). Both training data are augmented with an additional input feature of the Llama 3
embeddings to provide contextual information.
Text normalization data. Our TN training dataset includes 55K samples that cover a wide range of semiotic
classes (e.g., number, date, time) that require non-trivial normalization. Each sample is a pair of written-form
text and the corresponding normalized spoken-form text, with an inferred sequence of handcrafted TN rules
that carry out the normalization.
Prosody model data. The PM training data includes linguistic and prosodic features extracted from a 50K-hour
TTS dataset, which are paired transcripts and audios recorded by professional voice actors in studio settings.
Llama 3 embedding. The Llama 3 embeddings are taken as the output of the 16th decoder layer. We work
exclusively with the Llama 3 8B model and extract the embeddings for a given text (i.e. written-form input
text for TN or the audio transcript for PM) as if they are generated by the Llama 3 model with an empty
user prompt. In a given sample, each chunk in the Llama 3 token sequence is explicitly aligned with the
corresponding chunks in native input sequence for TN or PM, i.e., TN-specific text tokens (demarcated by
unicode category) or phone-rate features respectively. This allows for training the TN and PM modules with
streaming input of Llama 3 tokens and embeddings.
8.2 Model Architecture
8.2.1 Speech Understanding
On the input side, the speech module consists of two successive modules: a speech encoder and an adapter.
The output of the speech module is directly fed into the language model as token representation, enabling
direct interaction between speech and text tokens. Furthermore, we incorporate two new special tokens
to enclose the sequence of speech representations. The speech module differs substantially from the vision
module (see Section 7), which feeds multi-modal information into the language model via cross-attention
layers. By contrast, the speech module generates embeddings that can be seamlessly integrated with text
tokens, enabling the speech interface to leverage all the capabilities of the Llama 3 language model.
Speech encoder. Our speech encoder is a Conformer (Gulati et al., 2020) model with 1B parameters. The
input to the model consists of 80-dimensional mel-spectrogram features, which are first processed by a stride-4
stacking layer followed by a linear projection to reduce the frame length to 40 ms. The resulting features are
processed by an encoder with 24 Conformer layers. Each Conformer layer has a latent dimension of 1536,
and consists of two Macron-net style feed-forward networks with dimension 4096, a convolution module with
kernel size 7, and a rotary attention module (Su et al., 2024) with 24 attention heads.
Speech adapter. The speech adapter contains about 100M parameters. It is composed of a convolution layer,
a rotary Transformer layer, and a linear layer. The convolution layer has a kernel size of 3 and a stride of
2, which is designed to reduce the speech frame length to 80ms. This allows the model to provide more
coarse-grained features to the language model. The Transformer layer has a latent dimension of 3072 and a
feed-forward network with a dimension of 4096 which further processes the information from speech with
context after the convolutional downsampling. Finally, the linear layer maps the output dimension to match
that of the language-model embedding layer.
I agree, this is not obvious. Can somebody clarify? If Llama3.1 can perform better than a loosely integrated Whisper+TTS approach, this has a lot of potential.
EDIT: I see that the description now says "Llama3.1 (text-only)", but would still be good to know when/if the speech stack will be released.