| # Background: USM Encoder extracted from Gemma 3n model | |
| Gemma3n is able to process audio inputs. That is achieved by encoding audio with an Universal Speech Encoder (USM, https://arxiv.org/abs/2303.01037). | |
| This encoder operates at 6.5 frames per second, and each frame is a continuous embedding with a dimensionality of 1536. | |
| # This repo | |
| To facilitate experimentaion with this encoder, I've extracted weights of the audio encoder from the entire Gemma3n model, | |
| so that this encoder can be used separately. The weights are comming from this [HF Gemma3n repo](https://huggingface.co/google/gemma-3n-E4B-it). | |
| Some imports: | |
| ``` | |
| import torch | |
| from transformers.models.gemma3n.feature_extraction_gemma3n import Gemma3nAudioFeatureExtractor | |
| import sphn | |
| import librosa | |
| from transformers import Gemma3nAudioConfig, Gemma3nAudioEncoder | |
| from huggingface_hub import hf_hub_download | |
| ``` | |
| Loading the model: | |
| ``` | |
| configuration = Gemma3nAudioConfig() | |
| repo_id = "n0mad-0/gemma3n-usm-rip" | |
| filename = "usm.th" | |
| model_path = hf_hub_download(repo_id=repo_id, filename=filename) | |
| encoder = Gemma3nAudioEncoder(configuration).cuda() | |
| encoder.load_state_dict( | |
| torch.load(model_path, weights_only=True, map_location='cuda') | |
| ) | |
| ``` | |
| Now we load the audio, build and initialize feature extractor (prepares mel spectrograms), and the USM encoder: | |
| ``` | |
| feature_extractor = Gemma3nAudioFeatureExtractor() # operates on 30s chunks, expects 16_000 sampling rate | |
| audio, sample_rate = sphn.read("bria.mp3") | |
| audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=feature_extractor.sampling_rate) | |
| audio = audio[:, : 10 * feature_extractor.sampling_rate] | |
| features = feature_extractor(audio) | |
| audio_mel = torch.stack( | |
| [torch.from_numpy(x) for x in features['input_features']] | |
| ).cuda() | |
| audio_mel_mask = torch.stack( | |
| [torch.from_numpy(x) for x in features['input_features_mask']] | |
| ).cuda() | |
| emb, mask = encoder(audio_mel, ~audio_mel_mask) # seems I need to invert the mask? | |
| emb.shape # torch.Size([1, 63, 1536]) | |
| ``` | |
| --- | |
| license: gemma | |
| --- | |