# Background: USM Encoder extracted from Gemma 3n model Gemma3n is able to process audio inputs. That is achieved by encoding audio with an Universal Speech Encoder (USM, https://arxiv.org/abs/2303.01037). This encoder operates at 6.5 frames per second, and each frame is a continuous embedding with a dimensionality of 1536. # This repo To facilitate experimentaion with this encoder, I've extracted weights of the audio encoder from the entire Gemma3n model, so that this encoder can be used separately. The weights are comming from this [HF Gemma3n repo](https://huggingface.co/google/gemma-3n-E4B-it). Some imports: ``` import torch from transformers.models.gemma3n.feature_extraction_gemma3n import Gemma3nAudioFeatureExtractor import sphn import librosa from transformers import Gemma3nAudioConfig, Gemma3nAudioEncoder from huggingface_hub import hf_hub_download ``` Loading the model: ``` configuration = Gemma3nAudioConfig() repo_id = "n0mad-0/gemma3n-usm-rip" filename = "usm.th" model_path = hf_hub_download(repo_id=repo_id, filename=filename) encoder = Gemma3nAudioEncoder(configuration).cuda() encoder.load_state_dict( torch.load(model_path, weights_only=True, map_location='cuda') ) ``` Now we load the audio, build and initialize feature extractor (prepares mel spectrograms), and the USM encoder: ``` feature_extractor = Gemma3nAudioFeatureExtractor() # operates on 30s chunks, expects 16_000 sampling rate audio, sample_rate = sphn.read("bria.mp3") audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=feature_extractor.sampling_rate) audio = audio[:, : 10 * feature_extractor.sampling_rate] features = feature_extractor(audio) audio_mel = torch.stack( [torch.from_numpy(x) for x in features['input_features']] ).cuda() audio_mel_mask = torch.stack( [torch.from_numpy(x) for x in features['input_features_mask']] ).cuda() emb, mask = encoder(audio_mel, ~audio_mel_mask) # seems I need to invert the mask? emb.shape # torch.Size([1, 63, 1536]) ``` --- license: gemma ---