Do you support and audio+text->audio+text models?

#4
by TimeLordRaps - opened

Looking for a model that is instruction tuned to edit audio... easy training set would be programmatic edits based on a standard seed bank through agentic audio editing, which I set up for myself 8 months ago with Claude sonnet 3.5.0 as the driver, so I assume a trained model should from this methodology should exist by now... if not there aren't enough musicians in ai, and if that's the case, I'm already consciously working on the next cohort. I just don't have the editing skills to generate a high quality dataset... If I did I would've.

Simple process for those looking down the line:
give LM tools to edit audio through audio editing existing python libraries, or programmatic DAWs
Ask it questions in natural language like "make this smoother", faster, you name it.
Save in-audio, in-text, out-audio, and out-text, which would be the text generated as part of the message of the ai, not including the tool use.
Extend a audio+text->text architecture to predict audio tokens as well.
SFT the new architecture on the audio-edit dataset.
Preference optimize the new model with a regular audio+text->text model like the one Im posting this on.
Voila you've made ai-aligned music preference optimized audio+text->audio+text model.

Other possible training data sequences:
audio(no vocals) + text (lyrics) -> audio (with text as vocals) + text describing next possible edit task choices. for audio compilation
audio(with vocals) + text (remove the lyrics) -> audio(without lyrics) + text containing the transcribed lyrics and the style of the music. (JUST FOR BACK TRANSLATION OF THE PRIOR TASK, easily invertible.
no audio + text (lyrics + style) -> audio(with lyrics and style) + text describing the models thoughts on the generated tokens. This involves having the model listen to the audio after it has generated it, very interesting concept architecturally when you really think about it...

Sign up or log in to comment