|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- Audio |
|
|
- SSL |
|
|
- SSLAM |
|
|
- AudioEncoder |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# 🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes |
|
|
|
|
|
[](https://openreview.net/forum?id=odU59TxdiB) |
|
|
|
|
|
🚀 **SSLAM** is a self-supervised learning framework designed to enhance audio representation quality for both **polyphonic(multiple overlapping sounds)** and monophonic soundscapes. Unlike traditional SSL models that focus on monophonic data, SSLAM introduces a novel **source retention loss** and **audio mixture training**, significantly improving performance on real-world polyphonic audio. |
|
|
|
|
|
🔗 **[Paper](https://openreview.net/pdf?id=odU59TxdiB) | [ICLR 2025 Poster: Video & Slides](https://iclr.cc/virtual/2025/poster/28347) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [🤗 Models](https://huggingface.co/ta012/SSLAM_pretrain)** |
|
|
|
|
|
# SSLAM Pretrain (ViT Base, 15 epochs) |
|
|
|
|
|
This repository provides an SSLAM checkpoint formatted for use with Hugging Face Transformers. It is intended for feature extraction in audio LLMs, sound event detection, and general purpose audio representation learning. The implementation follows the [EAT](https://arxiv.org/abs/2401.03497) code path while swapping in SSLAM pretrained weights. |
|
|
|
|
|
|
|
|
|
|
|
## 🔧 Usage |
|
|
|
|
|
You can load and use the model for feature extraction directly via Hugging Face Transformers: |
|
|
|
|
|
```python |
|
|
import torchaudio |
|
|
import torch |
|
|
import soundfile as sf |
|
|
import numpy as np |
|
|
from transformers import AutoModel |
|
|
|
|
|
model_id = "ta012/SSLAM_pretrain" |
|
|
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().cuda() |
|
|
|
|
|
source_file = "/path/to/input.wav" |
|
|
target_length = 1024 # Recommended: 1024 for 10s audio |
|
|
norm_mean = -4.268 |
|
|
norm_std = 4.569 |
|
|
|
|
|
# Load and resample audio |
|
|
wav, sr = sf.read(source_file) |
|
|
waveform = torch.tensor(wav).float().cuda() |
|
|
if sr != 16000: |
|
|
waveform = torchaudio.functional.resample(waveform, sr, 16000) |
|
|
|
|
|
# Normalize and convert to mel-spectrogram |
|
|
waveform = waveform - waveform.mean() |
|
|
mel = torchaudio.compliance.kaldi.fbank( |
|
|
waveform.unsqueeze(0), |
|
|
htk_compat=True, |
|
|
sample_frequency=16000, |
|
|
use_energy=False, |
|
|
window_type='hanning', |
|
|
num_mel_bins=128, |
|
|
dither=0.0, |
|
|
frame_shift=10 |
|
|
).unsqueeze(0) |
|
|
|
|
|
# Pad or truncate |
|
|
n_frames = mel.shape[1] |
|
|
if n_frames < target_length: |
|
|
mel = torch.nn.ZeroPad2d((0, 0, 0, target_length - n_frames))(mel) |
|
|
else: |
|
|
mel = mel[:, :target_length, :] |
|
|
|
|
|
# Normalize |
|
|
mel = (mel - norm_mean) / (norm_std * 2) |
|
|
mel = mel.unsqueeze(0).cuda() # shape: [1, 1, T, F] |
|
|
|
|
|
# Extract features |
|
|
with torch.no_grad(): |
|
|
feat = model.extract_features(mel) |
|
|
|
|
|
feat = feat.squeeze(0).cpu().numpy() |
|
|
print(f"Feature shape: {feat.shape}") |
|
|
``` |
|
|
|
|
|
## 📌 Notes |
|
|
|
|
|
|
|
|
See the [feature extraction guide](https://github.com/cwx-worst-one/EAT/tree/main/feature_extract) for more instructions. |
|
|
|
|
|
|
|
|
## 🙌 Acknowledgments |
|
|
|
|
|
This repository builds on the [EAT](https://github.com/cwx-worst-one/EAT) implementation for Hugging Face models. We remap SSLAM weights to that interface. We are not affiliated with the EAT authors. All credit for the original implementation belongs to them. |
|
|
|
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
|
|
|
If you find our work useful, please cite it as: |
|
|
|
|
|
|
|
|
```bibtex |
|
|
@inproceedings{alex2025sslam, |
|
|
title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes}, |
|
|
author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson}, |
|
|
booktitle={The Thirteenth International Conference on Learning Representations}, |
|
|
year={2025}, |
|
|
url={https://openreview.net/forum?id=odU59TxdiB} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Please also cite EAT: |
|
|
|
|
|
```bibtex |
|
|
@article{chen2024eat, |
|
|
title={EAT: Self-supervised pre-training with efficient audio transformer}, |
|
|
author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie}, |
|
|
journal={arXiv preprint arXiv:2401.03497}, |
|
|
year={2024} |
|
|
} |
|
|
``` |