n0mad-0
/

gemma3n-usm-rip

Model card Files Files and versions

gemma3n-usm-rip / README.md

n0mad-0's picture

initial description

949340e verified 5 months ago

|

history blame contribute delete

2.03 kB

	# Background: USM Encoder extracted from Gemma 3n model

	Gemma3n is able to process audio inputs. That is achieved by encoding audio with an Universal Speech Encoder (USM, https://arxiv.org/abs/2303.01037).
	This encoder operates at 6.5 frames per second, and each frame is a continuous embedding with a dimensionality of 1536.


	# This repo

	To facilitate experimentaion with this encoder, I've extracted weights of the audio encoder from the entire Gemma3n model,
	so that this encoder can be used separately. The weights are comming from this [HF Gemma3n repo](https://huggingface.co/google/gemma-3n-E4B-it).

	Some imports:
	```
	import torch
	from transformers.models.gemma3n.feature_extraction_gemma3n import Gemma3nAudioFeatureExtractor
	import sphn
	import librosa
	from transformers import Gemma3nAudioConfig, Gemma3nAudioEncoder
	from huggingface_hub import hf_hub_download
	```
	Loading the model:
	```

	configuration = Gemma3nAudioConfig()

	repo_id = "n0mad-0/gemma3n-usm-rip"
	filename = "usm.th"

	model_path = hf_hub_download(repo_id=repo_id, filename=filename)

	encoder = Gemma3nAudioEncoder(configuration).cuda()
	encoder.load_state_dict(
	torch.load(model_path, weights_only=True, map_location='cuda')
	)

	```

	Now we load the audio, build and initialize feature extractor (prepares mel spectrograms), and the USM encoder:

	```
	feature_extractor = Gemma3nAudioFeatureExtractor() # operates on 30s chunks, expects 16_000 sampling rate

	audio, sample_rate = sphn.read("bria.mp3")
	audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=feature_extractor.sampling_rate)
	audio = audio[:, : 10 * feature_extractor.sampling_rate]

	features = feature_extractor(audio)

	audio_mel = torch.stack(
	[torch.from_numpy(x) for x in features['input_features']]
	).cuda()

	audio_mel_mask = torch.stack(
	[torch.from_numpy(x) for x in features['input_features_mask']]
	).cuda()

	emb, mask = encoder(audio_mel, ~audio_mel_mask) # seems I need to invert the mask?
	emb.shape # torch.Size([1, 63, 1536])
	```


	---
	license: gemma
	---