ASR Inference
Hi,
I've managed to fine-tune the ASR model, the last step in this recipe, however I'm struggling to understand how I can easily use an inference class to transcribe enhanced audio files with the model. It seems like the pretrained ASR model may use a encoder decoder interface, but the modules produced in the hyperparameters final output are different from what is expected by that interface.
Clearly based on the 'test_stats' a model is produced that can perform ASR, but what interface to use for inference is a bit unclear - whether this needs to be something custom, or if it's simpler than that. If you could provide some clarity regarding this, that would be helpful.
Thanks.
Hi, thanks for your interest!
Unfortunately, we haven't gotten around to writing an inference class for the robust-ASR model. You can do it yourself, however with code similar to this (untested):
noisy_wavs, wav_lens = batch.noisy_sig
cleaned_wavs, _ = self.modules.enhance_model(noisy_wavs)
asr_feats = self.hparams.fbank(cleaned_wavs)
asr_feats = self.hparams.normalizer(asr_feats, wav_lens)
embed = self.modules.src_embedding(asr_feats)
hypotheses, _ = self.hparams.beam_searcher(embed.detach(), wav_lens)
pred_words = [self.token_encoder.decode_ids(token_seq) for token_seq in hypotheses]
Hope this helps!
I suppose it might also work to use EncoderDecoderASR
inference class with a custom YAML file. Try to copy the https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech/blob/main/hyperparams.yaml
And add the enhance model in 2 places:
+ enhance_model: # ... copy from the file in this repo
...
# We compose the inference (encoder) pipeline.
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
input_shape: [null, null]
+ enhance_features: !ref <enhance_model>
compute_features: !ref <compute_features>
normalize: !ref <normalizer>
model: !ref <enc>
First bit of code worked without issues. Thanks for that one!