gemma3n-audio-encoder-whisper-decoder

Combine mesolitica/gemma-3n-e4b-it-audio-encoder Encoder + Projection + openai/whisper-large-v3-turbo Decoder.

This model use to introduce VQ for projection layer later.

WanDB at https://wandb.ai/huseinzol05/gemma3n-audio-whisper-decoder-v2

Training dataset

  1. malaysia-ai/common_voice_17_0
  2. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
  3. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments

how to use

from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa

model_id = "mesolitica/gemma3n-audio-encoder-whisper-decoder"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)

y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
    '<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>', 
    add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt')
features['input_features'] = features['input_features'].cuda()
features['input_features_mask'] = features['input_features_mask'].cuda()
features['attention_mask'] = features['input_features_mask']
features['decoder_input_ids'] = input_ids.cuda()

generate_kwargs = dict(
    **features,
    max_new_tokens=1024,
    temperature=0.1,
    do_sample=True
)
generation_output = model.generate(**generate_kwargs)
print(tokenizer.decode(generation_output[0]))

Output,

<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Кубы сыраохта был халя гешенең битарафлыгы сәпәпсем.<|endoftext|>

Evaluation

Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,

  1. Lower case.
  2. Remove punctuation.
  3. Provide language tagging for decoder input ids, <|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
lang: gl, samples: 9949, CER: 0.10001307096111269
lang: en, samples: 16379, CER: 0.17855522231488422
lang: ar, samples: 10458, CER: 0.3920973287847606
lang: kab, samples: 14972, CER: 0.2890434778362997
lang: ml, samples: 703, CER: 0.53111034888622
lang: kk, samples: 514, CER: 0.26197896016576977
lang: ltg, samples: 2904, CER: 0.2302614426752208
lang: fr, samples: 16145, CER: 0.14363098867403024
lang: de, samples: 16170, CER: 0.10203276511545173
lang: fi, samples: 1554, CER: 0.19801679406159686
lang: pt, samples: 9432, CER: 0.19866563306611487
lang: ia, samples: 1816, CER: 0.10785148694066107
lang: eu, samples: 13621, CER: 0.0816159607087014
lang: ro, samples: 3896, CER: 0.1749313437242837
lang: sw, samples: 12086, CER: 0.18174559387398373
lang: sv-SE, samples: 5247, CER: 0.24699414948855683
lang: ta, samples: 8263, CER: 0.1591554747060857
lang: et, samples: 2653, CER: 0.28433640204000193
lang: lg, samples: 11902, CER: 0.19873733001302288
lang: it, samples: 15154, CER: 0.09142970979792102
lang: mhr, samples: 15107, CER: 0.1460395233055186
lang: sr, samples: 1539, CER: 0.18619033837797466
lang: mr, samples: 1437, CER: 0.2693224678860725
lang: ka, samples: 12608, CER: 0.11379174726810976
lang: es, samples: 15848, CER: 0.08456472808944927
lang: be, samples: 15878, CER: 0.07999612214738314
lang: lt, samples: 4753, CER: 0.21739870113959847
lang: ca, samples: 16389, CER: 0.07843294405976182
lang: eo, samples: 14773, CER: 0.07092754124459882
lang: tr, samples: 11235, CER: 0.13840346899649097
lang: hu, samples: 11435, CER: 0.14113964126152187
lang: ja, samples: 6033, CER: 0.6701012647100653
lang: br, samples: 2202, CER: 0.37293224449143
lang: ne-NP, samples: 217, CER: 0.38921455930467563
lang: uz, samples: 12006, CER: 0.16131074778565763
lang: ru, samples: 10184, CER: 0.1717562392034892
lang: dv, samples: 2213, CER: 0.5041609913222977
lang: tt, samples: 4953, CER: 0.17021703446818562
lang: rw, samples: 14797, CER: 0.20083010501670184
lang: bn, samples: 9327, CER: 0.28723572975468753
lang: ug, samples: 6108, CER: 0.18680496552495066
lang: rm-sursilv, samples: 1361, CER: 0.25612906269681673
lang: bg, samples: 3201, CER: 0.19931258488649384
lang: ab, samples: 9108, CER: 0.20407915408179322
lang: uk, samples: 9915, CER: 0.1507872570822808
lang: mt, samples: 1662, CER: 0.2914987670344507
lang: fa, samples: 10292, CER: 0.20739419466433526
lang: pl, samples: 9186, CER: 0.18724216754108106
lang: bas, samples: 541, CER: 0.34955661423354883
lang: nl, samples: 11255, CER: 0.1417839093645191
lang: zh-CN, samples: 10335, CER: 0.5528945143870808
lang: tok, samples: 2175, CER: 0.06793887604702466
lang: ur, samples: 4052, CER: 0.21689643521492455
lang: sk, samples: 2593, CER: 0.211394177390933
lang: oc, samples: 254, CER: 0.3122240568793787
lang: yue, samples: 2585, CER: 0.6512225359702296
lang: mrj, samples: 7102, CER: 0.19361595158107917
lang: fy-NL, samples: 3167, CER: 0.22319058088514185
lang: cs, samples: 9055, CER: 0.16735999160107332
lang: th, samples: 10982, CER: 0.25952886326322894
lang: ckb, samples: 5262, CER: 0.19870016286852615
lang: mn, samples: 1896, CER: 0.3460300138649441
lang: ky, samples: 1604, CER: 0.2485977401511545
lang: skr, samples: 1006, CER: 0.3679916888713186
lang: hy-AM, samples: 4281, CER: 0.18672960280946826
lang: sl, samples: 1242, CER: 0.19342800359225812
lang: vi, samples: 1077, CER: 0.3215907676168487
lang: hi, samples: 3151, CER: 0.2159702205037038
lang: nan-tw, samples: 2317, CER: 0.6581131657053445
lang: id, samples: 3633, CER: 0.11272396366324638
lang: cy, samples: 5371, CER: 0.27364002764427275
lang: yo, samples: 999, CER: 0.5129866240233623
lang: sah, samples: 1455, CER: 0.21852319472052653
lang: mk, samples: 1097, CER: 0.1904789445208563
lang: cv, samples: 1288, CER: 0.2960106892111324
lang: myv, samples: 479, CER: 0.21331218745348024
lang: da, samples: 2405, CER: 0.3030715126105987
lang: lv, samples: 6738, CER: 0.17671911982030888
lang: kmr, samples: 3900, CER: 0.20755413256986208
lang: tk, samples: 545, CER: 0.38378755119423247
lang: nn-NO, samples: 370, CER: 0.30825107532406326
lang: ha, samples: 661, CER: 0.2820795064549202
lang: he, samples: 260, CER: 0.8540917719794076
lang: dyu, samples: 59, CER: 0.3676472621477804
lang: gn, samples: 855, CER: 0.36521987550483853
lang: lij, samples: 694, CER: 0.32638969844903387
lang: hsb, samples: 444, CER: 0.2978765899428805
lang: pa-IN, samples: 487, CER: 0.6379148851920289
lang: el, samples: 1696, CER: 0.23668647924320704
lang: zgh, samples: 159, CER: 1.0
lang: as, samples: 551, CER: 0.4089405271640912
lang: sq, samples: 472, CER: 0.3082234872203625
lang: ko, samples: 338, CER: 1.0
lang: ga-IE, samples: 517, CER: 0.5036573978122563
lang: cnh, samples: 763, CER: 0.2821869658617442
lang: sat, samples: 147, CER: 1.0
lang: rm-vallader, samples: 462, CER: 0.31260688644939844
lang: or, samples: 670, CER: 0.9344857917784284
lang: mdf, samples: 104, CER: 0.28625694258580164
lang: af, samples: 62, CER: 0.34808710360162043
lang: ig, samples: 4, CER: 0.6830073016564953
lang: sc, samples: 232, CER: 0.33633577407382353
lang: tig, samples: 169, CER: 1.0
lang: te, samples: 49, CER: 0.6987145525464048
lang: ps, samples: 199, CER: 0.3551799159800677
lang: am, samples: 205, CER: 0.8710092686406888
lang: ast, samples: 162, CER: 0.21987822338723598
lang: os, samples: 50, CER: 0.5566219928475798
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.3339418766132571
lang: ti, samples: 4, CER: 1.0
lang: vot, samples: 6, CER: 0.347936293302426
lang: nhi, samples: 5, CER: 0.48875994972769166
lang: yi, samples: 6, CER: 0.8944217107490507
lang: tw, samples: 9, CER: 0.40325320135071413

average CER: 0.33489257308696924

Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/gemma3n-audio-whisper-decoder

Downloads last month
8
Safetensors
Model size
859M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mesolitica/gemma3n-audio-encoder-whisper-decoder