DeCRED-base
This is a 39M encoder-decoder Ebranchformer model trained on 6,000 hours of open-source normalised English data.
Architecture details, training hyperparameters, and a description of the proposed technique will be added soon.
Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.
The model can be used with the pipeline
class to transcribe audio files of arbitrary length.
from transformers import pipeline
model_id = "BUT-FIT/ED-small"
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True)
pipe.type = "seq2seq"
result_beam = pipe("audio.wav")
pipe.model.generation_config.ctc_weight = 0.0
pipe.model.generation_config.num_beams = 1
result_greedy = pipe("audio.wav")