Running Your Custom LoRA Fine-Tuned MusicGen Large Locally

Community Article Published December 6, 2024

Creating custom AI-generated music with LoRA (Low-Rank Adaptation) fine-tuned models like MusicGen Large is a powerful way to harness AI's creative potential. This guide will walk you through running your own fine-tuned model locally, fine-tuning your own LoRA model, and deploying it to an API for broader accessibility. Alternatively, you can use pre-trained LoRA adapters from platforms like Hugging Face.


1. Setting Up Your Environment

Before starting, ensure you have the following:

  • A CUDA-compatible GPU for acceleration.
  • Python 3.8 or later installed.
  • Necessary Python libraries: torch, transformers, peft, soundfile, fastapi, uvicorn.

Install the required libraries using pip:

pip install torch transformers peft soundfile fastapi uvicorn

2. Fine-Tuning Your Own LoRA Model

If you'd like to fine-tune your own LoRA adapter, follow these steps:

  1. Prepare Your Dataset:

    • Collect high-quality audio and corresponding text prompts.
    • Preprocess the data to align audio-text pairs.
  2. Use the PEFT Framework:

    • Train a LoRA adapter using the PEFT library.
    • Save the adapter configuration and weights in a directory.
  3. Test Your Fine-Tuned Model:

    • Load the fine-tuned adapter into a base model for evaluation.

For detailed instructions, refer to the PEFT documentation.


3. Using a Pre-Trained LoRA from Hugging Face

If you'd prefer to use a pre-trained LoRA adapter:

  1. Download the Adapter: Visit Hugging Face and search for a MusicGen-compatible LoRA. Download or clone the repository.

  2. Set the Local Repository Path: Update the local_repo_path in the code to point to the directory containing your LoRA adapter files.


4. Running the Model Locally

Below is a Python script for running your custom LoRA fine-tuned MusicGen Large model locally and deploying it to an API.


The Code

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForTextToWaveform, AutoProcessor
import soundfile as sf
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn
import os
import time
import gc
from contextlib import asynccontextmanager

class MusicRequest(BaseModel):
    prompt: str
    duration: int

# Configuration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
local_repo_path = "/path/to/your/lora"  # Update this path
base_model_name = "facebook/musicgen-large"
model, processor = None, None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model, processor
    try:
        # Load LoRA configuration
        adapter_config_path = os.path.join(local_repo_path, "adapter_config.json")
        if os.path.exists(adapter_config_path):
            adapter_config = PeftConfig.from_pretrained(local_repo_path)
            base_model_name_or_path = adapter_config.base_model_name_or_path
            base_model = AutoModelForTextToWaveform.from_pretrained(
                base_model_name_or_path,
                torch_dtype=torch.float16,
                local_files_only=True
            )
            model = PeftModel.from_pretrained(base_model, local_repo_path, local_files_only=True).to(device)
        else:
            model = AutoModelForTextToWaveform.from_pretrained(
                base_model_name,
                torch_dtype=torch.float16,
                local_files_only=True
            ).to(device)

        processor = AutoProcessor.from_pretrained(
            local_repo_path if os.path.exists(adapter_config_path) else base_model_name
        )
        yield
    finally:
        del model
        torch.cuda.empty_cache()
        gc.collect()

app = FastAPI(lifespan=lifespan)

@app.post("/generate-music/")
async def generate_music(request: MusicRequest):
    global model, processor
    if model is None:
        raise HTTPException(status_code=500, detail="Model not loaded")
    if request.duration <= 0:
        raise HTTPException(status_code=400, detail="Invalid duration")
    
    # Preprocess input prompt
    processed_prompt = f"Genre: upbeat; Description: {request.prompt}"
    
    inputs = processor(text=[processed_prompt], return_tensors="pt").to(device)
    max_new_tokens = int(request.duration * 50)  # Approximate tokens per second
    audio_values = model.generate(
        **inputs,
        do_sample=True,
        guidance_scale=3,
        max_new_tokens=max_new_tokens
    )
    
    # Post-process audio: normalize and save
    sampling_rate = model.config.audio_encoder.sampling_rate
    output_path = f"song_{int(time.time())}.wav"
    audio_values_normalized = (audio_values[0].cpu().numpy() / abs(audio_values[0].cpu().numpy()).max()) * 0.9  # Normalize
    sf.write(output_path, audio_values_normalized, sampling_rate)
    return {"song": output_path}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

5. Deploying Your API

  1. Run the Script:

    python script.py
    
  2. Test the API: Use tools like Postman or curl:

    curl -X POST "http://localhost:8000/generate-music/" -H "Content-Type: application/json" -d '{"prompt": "upbeat jazz", "duration": 10}'
    

6. Suggestions for Improving Music Quality

1. Preprocess Prompts

  • Use structured and descriptive prompts to provide clear instructions to the model (e.g., include genres, tempos, or instruments).

2. Post-Process Audio

  • Normalize the audio output to remove clipping and balance the volume.
  • Apply low-pass filters or denoise audio using libraries like librosa.

3. Fine-Tune with High-Quality Data

  • Curate datasets that align with your target music style.
  • Focus on high-fidelity recordings with minimal noise for fine-tuning.

4. Experiment with Parameters

  • Adjust guidance_scale to control creativity vs. coherence.
  • Increase max_new_tokens for more complex outputs.

5. Validate Audio Quality

  • Use audio metrics like Signal-to-Noise Ratio (SNR) to ensure the generated music meets quality thresholds.

6. Add Style Control

  • Fine-tune or use LoRA adapters that allow control over specific styles or instruments for more diverse outputs.

These improvements will ensure cleaner, more refined music generation while giving users greater creative control.