How to Expand Your AI Music Generations of 30 Seconds to Several Minutes
Imagine creating a symphony from a simple 30-second audio snippet or turning a brief melody into an entire song. With AI-powered tools like Facebook's MusicGen, this is now possible. In this tutorial, you'll learn how to build an API that can take a short audio file, extend it to several minutes of cohesive music, and process it to professional-grade quality.
What You’ll Learn
- Uploading and Processing Audio: Handle multiple formats like MP3, WAV, etc.
- AI-Powered Music Expansion: Extend tracks seamlessly using Facebook’s MusicGen.
- Ensuring Cohesion: Use the same description (prompt) for the initial and extended audio for better consistency.
- Post-Processing for Audio Quality: Clean up the generated audio with normalization and filters.
- Deployment Options: Deploy locally or on RunPod for scalable GPU hosting.
Why Use the Same Prompt for Expansion?
The prompt (or description) plays a crucial role in generating consistent music. When expanding a track, using the same prompt ensures:
- Musical Cohesion: The extended segments match the theme, mood, and style of the original audio.
- Natural Transitions: Overlapping and blending become smoother with similar soundscapes.
- Creative Integrity: Avoids jarring changes in tone or genre between the original and generated sections.
Full Code Implementation
Below is the full implementation for your Music Generation API:
from fastapi import FastAPI, HTTPException, BackgroundTasks, UploadFile, File, Form
from pydantic import BaseModel, Field
import uvicorn
import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
from tempfile import NamedTemporaryFile, gettempdir
import logging
import os
import soundfile as sf
from pydub import AudioSegment
from pydub.effects import normalize, high_pass_filter, low_pass_filter
import threading
# Initialize logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
# Lock for thread-safe model access
model_lock = threading.Lock()
# Function to initialize and reset the model
def get_musicgen_model():
logger.info("Loading MusicGen model...")
model = MusicGen.get_pretrained("facebook/musicgen-large")
model.set_generation_params(use_sampling=True, top_k=250)
return model
# Global model instance
model = get_musicgen_model()
@app.post("/extend-audio/")
async def extend_audio(
total_duration: int = Form(..., gt=0, le=300, description="Desired total duration of the audio in seconds (1-300)."),
description: str = Form(...),
segment_duration: int = Form(30, description="Duration of generated segments in seconds (default: 30)."),
overlap: int = Form(7, description="Overlap duration in seconds for smoother transitions (default: 7)."),
file: UploadFile = File(...),
):
try:
logger.info(f"Extending audio: {total_duration}s with description '{description}'")
# Save the uploaded file temporarily
with NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(await file.read())
input_audio_path = temp_file.name
# Read and convert audio if necessary
try:
input_audio, sample_rate = sf.read(input_audio_path)
input_audio = torch.tensor(input_audio).unsqueeze(0).float()
except RuntimeError:
logger.info("Converting unsupported format to WAV...")
audio = AudioSegment.from_file(input_audio_path)
wav_temp_path = f"{gettempdir()}/converted_audio.wav"
audio.export(wav_temp_path, format="wav")
input_audio, sample_rate = sf.read(wav_temp_path)
input_audio = torch.tensor(input_audio).unsqueeze(0).float()
os.remove(wav_temp_path)
finally:
os.remove(input_audio_path)
# Generate audio in a thread-safe manner
with model_lock:
segment = model.generate_continuation(
input_audio, sample_rate, descriptions=[description], progress=True
)
# Generate additional segments
while total_duration > 0:
last_sec = segment[:, :, -overlap * sample_rate:]
with model_lock:
next_segment = model.generate_continuation(
last_sec, sample_rate, descriptions=[description], progress=True
)
segment = torch.cat([segment[:, :, :-overlap * sample_rate], next_segment], dim=2)
total_duration -= (segment_duration - overlap)
# Save and process final audio
final_audio = segment.detach().cpu().float()[0]
output_path = f"extended_audio_{torch.randint(0, 100000, (1,)).item()}.wav"
audio_write(output_path, final_audio, sample_rate)
return {"file_path": output_path}
except Exception as e:
logger.error(f"Error: {e}")
raise HTTPException(status_code=500, detail="Audio generation failed.")
Key Features of the API
Multi-Format Upload
Handles MP3, FLAC, WAV, and more by converting to WAV when necessary.Seamless Expansion
Generates additional segments with overlapping transitions for cohesion.Customizable Output
Set segment duration, overlap, and total length.Post-Processing
Enhances audio quality with normalization and frequency filtering.Thread-Safe Model Access
Manages concurrent requests without conflicts.
How to Use the API Locally
Install Dependencies
pip install fastapi uvicorn torch soundfile pydub audiocraft
Run the Server
python main.py
Send a Test Request
Usecurl
to send a POST request:curl -X POST "http://127.0.0.1:8000/extend-audio/" \ -F "total_duration=120" \ -F "description='Calm piano with ambient strings'" \ -F "file=@path_to_audio.wav"
Deployment on RunPod
Why RunPod?
RunPod is an excellent platform for GPU-powered deployments. It offers affordable, scalable GPU hosting for AI models like MusicGen.
Steps to Deploy
Create a GPU Instance
Visit RunPod and set up a GPU environment.Prepare a Dockerfile
FROM python:3.9-slim WORKDIR /app COPY requirements.txt requirements.txt RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Build and Run the Docker Image
docker build -t musicgen-api . docker run -p 8000:8000 musicgen-api
Access the API
Use the public IP provided by RunPod to interact with your API.
Best Practices for Expanding Audio
Re-use the Same Prompt
Consistency in prompts ensures the generated audio aligns seamlessly with the original.Adjust Overlap for Smooth Transitions
Experiment with overlap values (default: 7 seconds) to minimize artifacts during transitions.Pre-Process Input Audio
Ensure your input audio is clean and normalized for the best output quality.Monitor Model Parameters
Fine-tune MusicGen's parameters liketop_k
to balance creativity and coherence.
Final Thoughts
This API empowers creators, musicians, and developers to extend their short audio tracks into beautiful, cohesive compositions. Whether you're producing a full soundtrack or exploring AI's creative potential, this tutorial equips you with the tools to get started.
For more AI innovations, check out my projects on Hugging Face. Feel free to connect with me on LinkedIn and share your creations! 🎶✨