How to Expand Your AI Music Generations of 30 Seconds to Several Minutes

Community Article Published December 13, 2024

Imagine creating a symphony from a simple 30-second audio snippet or turning a brief melody into an entire song. With AI-powered tools like Facebook's MusicGen, this is now possible. In this tutorial, you'll learn how to build an API that can take a short audio file, extend it to several minutes of cohesive music, and process it to professional-grade quality.


What You’ll Learn

  • Uploading and Processing Audio: Handle multiple formats like MP3, WAV, etc.
  • AI-Powered Music Expansion: Extend tracks seamlessly using Facebook’s MusicGen.
  • Ensuring Cohesion: Use the same description (prompt) for the initial and extended audio for better consistency.
  • Post-Processing for Audio Quality: Clean up the generated audio with normalization and filters.
  • Deployment Options: Deploy locally or on RunPod for scalable GPU hosting.

Why Use the Same Prompt for Expansion?

The prompt (or description) plays a crucial role in generating consistent music. When expanding a track, using the same prompt ensures:

  1. Musical Cohesion: The extended segments match the theme, mood, and style of the original audio.
  2. Natural Transitions: Overlapping and blending become smoother with similar soundscapes.
  3. Creative Integrity: Avoids jarring changes in tone or genre between the original and generated sections.

Full Code Implementation

Below is the full implementation for your Music Generation API:

from fastapi import FastAPI, HTTPException, BackgroundTasks, UploadFile, File, Form
from pydantic import BaseModel, Field
import uvicorn
import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
from tempfile import NamedTemporaryFile, gettempdir
import logging
import os
import soundfile as sf
from pydub import AudioSegment
from pydub.effects import normalize, high_pass_filter, low_pass_filter
import threading

# Initialize logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

# Lock for thread-safe model access
model_lock = threading.Lock()

# Function to initialize and reset the model
def get_musicgen_model():
    logger.info("Loading MusicGen model...")
    model = MusicGen.get_pretrained("facebook/musicgen-large")
    model.set_generation_params(use_sampling=True, top_k=250)
    return model

# Global model instance
model = get_musicgen_model()

@app.post("/extend-audio/")
async def extend_audio(
    total_duration: int = Form(..., gt=0, le=300, description="Desired total duration of the audio in seconds (1-300)."),
    description: str = Form(...),
    segment_duration: int = Form(30, description="Duration of generated segments in seconds (default: 30)."),
    overlap: int = Form(7, description="Overlap duration in seconds for smoother transitions (default: 7)."),
    file: UploadFile = File(...),
):
    try:
        logger.info(f"Extending audio: {total_duration}s with description '{description}'")

        # Save the uploaded file temporarily
        with NamedTemporaryFile(delete=False) as temp_file:
            temp_file.write(await file.read())
            input_audio_path = temp_file.name

        # Read and convert audio if necessary
        try:
            input_audio, sample_rate = sf.read(input_audio_path)
            input_audio = torch.tensor(input_audio).unsqueeze(0).float()
        except RuntimeError:
            logger.info("Converting unsupported format to WAV...")
            audio = AudioSegment.from_file(input_audio_path)
            wav_temp_path = f"{gettempdir()}/converted_audio.wav"
            audio.export(wav_temp_path, format="wav")
            input_audio, sample_rate = sf.read(wav_temp_path)
            input_audio = torch.tensor(input_audio).unsqueeze(0).float()
            os.remove(wav_temp_path)
        finally:
            os.remove(input_audio_path)

        # Generate audio in a thread-safe manner
        with model_lock:
            segment = model.generate_continuation(
                input_audio, sample_rate, descriptions=[description], progress=True
            )

        # Generate additional segments
        while total_duration > 0:
            last_sec = segment[:, :, -overlap * sample_rate:]
            with model_lock:
                next_segment = model.generate_continuation(
                    last_sec, sample_rate, descriptions=[description], progress=True
                )
            segment = torch.cat([segment[:, :, :-overlap * sample_rate], next_segment], dim=2)
            total_duration -= (segment_duration - overlap)

        # Save and process final audio
        final_audio = segment.detach().cpu().float()[0]
        output_path = f"extended_audio_{torch.randint(0, 100000, (1,)).item()}.wav"
        audio_write(output_path, final_audio, sample_rate)

        return {"file_path": output_path}

    except Exception as e:
        logger.error(f"Error: {e}")
        raise HTTPException(status_code=500, detail="Audio generation failed.")

Key Features of the API

  1. Multi-Format Upload
    Handles MP3, FLAC, WAV, and more by converting to WAV when necessary.

  2. Seamless Expansion
    Generates additional segments with overlapping transitions for cohesion.

  3. Customizable Output
    Set segment duration, overlap, and total length.

  4. Post-Processing
    Enhances audio quality with normalization and frequency filtering.

  5. Thread-Safe Model Access
    Manages concurrent requests without conflicts.


How to Use the API Locally

  1. Install Dependencies

    pip install fastapi uvicorn torch soundfile pydub audiocraft
    
  2. Run the Server

    python main.py
    
  3. Send a Test Request
    Use curl to send a POST request:

    curl -X POST "http://127.0.0.1:8000/extend-audio/" \
    -F "total_duration=120" \
    -F "description='Calm piano with ambient strings'" \
    -F "file=@path_to_audio.wav"
    

Deployment on RunPod

Why RunPod?

RunPod is an excellent platform for GPU-powered deployments. It offers affordable, scalable GPU hosting for AI models like MusicGen.

Steps to Deploy

  1. Create a GPU Instance
    Visit RunPod and set up a GPU environment.

  2. Prepare a Dockerfile

    FROM python:3.9-slim
    
    WORKDIR /app
    
    COPY requirements.txt requirements.txt
    RUN pip install --no-cache-dir -r requirements.txt
    
    COPY . .
    
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
    
  3. Build and Run the Docker Image

    docker build -t musicgen-api .
    docker run -p 8000:8000 musicgen-api
    
  4. Access the API
    Use the public IP provided by RunPod to interact with your API.


Best Practices for Expanding Audio

  1. Re-use the Same Prompt
    Consistency in prompts ensures the generated audio aligns seamlessly with the original.

  2. Adjust Overlap for Smooth Transitions
    Experiment with overlap values (default: 7 seconds) to minimize artifacts during transitions.

  3. Pre-Process Input Audio
    Ensure your input audio is clean and normalized for the best output quality.

  4. Monitor Model Parameters
    Fine-tune MusicGen's parameters like top_k to balance creativity and coherence.


Final Thoughts

This API empowers creators, musicians, and developers to extend their short audio tracks into beautiful, cohesive compositions. Whether you're producing a full soundtrack or exploring AI's creative potential, this tutorial equips you with the tools to get started.

For more AI innovations, check out my projects on Hugging Face. Feel free to connect with me on LinkedIn and share your creations! 🎶✨