This repo contains model for music generation from images. The generated music returns in ABC format and it can be sound for example here. Note, that you need to correct BPM (this is speed) to make music more logical and natural. The model is fune-tuned concatecation of two pre-trained models: google/vit-base-patch16-224 as encoder and sander-wood/text-to-music as decoder. To use this model you can write this:
from PIL import Image
import requests
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTImageProcessor
def generate_music(model, image, tokenizer):
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
generated_tokens = model.generate(
pixel_values,
max_length=300,
num_beams=5,
top_p=0.8,
temperature=2.0,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_music = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
return generated_music
path = 'AnyaSchen/image2music'
fine_tuned_model = VisionEncoderDecoderModel.from_pretrained(path).to(device)
feature_extractor = ViTImageProcessor.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)
url = 'https://anandaindia.org/wp-content/uploads/2018/12/happy-man.jpg'
image = Image.open(requests.get(url, stream=True).raw)
generated_music = generate_music(fine_tuned_model, image, tokenizer)
print(generated_music)
- Downloads last month
- 10
Inference API (serverless) does not yet support transformers models for this pipeline type.