Trying to download the model locally and use it

#2
by lakshmiu - opened

Can you add some example code of how all the audio features could be used?

Fixie.ai org
edited Jul 12

Trying to download the model locally and use it

We recently added a pipeline, try to use that:

# pip install transformers peft librosa

import transformers
import numpy as np
import librosa

pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_2', trust_remote_code=True)

path = "<path-to-input-audio>"  # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)


pipe({'audio': audio, prompt='<|audio|>', 'sampling_rate': sr}, max_new_tokens=30)

how all the audio features could be used?

I'm not sure what you mean by this.

Updated: audio_array -> audio

ok will give it a try with the pipeline.

Wanted to achieve something as you had with Message API-

{
"model": "fixie-ai/ultravox-v0.2",
"messages": [{
"role": "user",
"content": [{
"type": "text",
"text": "Listen to the following audio and respond accordingly:"
}, {
"type": "image_url",
"image_url": {
"url": "data:audio/wav;base64,{base64_wav}"
}
}]
}],
"stream": true
}

Secondly wanted to find out if I can query for other features like -

instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]"

Thanks for the quick response

#####Have added some comments to your snippet code
import transformers
import numpy as np
import librosa

#Had to give feature-extraction in the pipeline
pipe = transformers.pipeline("feature-extraction", model='fixie-ai/ultravox-v0_2', trust_remote_code=True)

path = "" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)

#Do I need to convert the audio into audio_array?
pipe({'audio': audio_array, prompt='<|audio|>', 'sampling_rate': sr}, max_new_tokens=30)

Fixie.ai org

#Do I need to convert the audio into audio_array?

No need. I hastily changed the code on the fly and missed this. If you use librosa.load, the first part of the output will be the audio_array.

#Had to give feature-extraction in the pipeline

That should not be required. The feature-extraction pipeline will not work correctly. What is the error you get? And what version of the transformers library are you using?

instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]"

Ultravox doesn't support any paralinguistic information in this version.

"stream": true

The pipeline only works for batch inference. We'll explore later whether HF pipeline supports this feature. That ability is only available through the repo at the moment (see infer_tool.py).

All that said, it seems to me like you want a hosted version of Ultravox API. We are working with Baseten to provide easier access to the API through a hosted service: https://www.baseten.co/library/ultravox/

No i dont need a hosted version. I want to run the models locally for now.

import transformers
import numpy as np
import librosa

pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_2', trust_remote_code=True, device=0)

path = "blue.mp4" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)

output = pipe({'audio': audio, 'prompt':'<|audio|>', 'sampling_rate': sr}, max_new_tokens=30)
print(output)

I dont see any output. The audio file blue.mp4 - is why is the skly blue?

Fixie.ai org

Not sure why you're having issues. I ran basically the exact same code inside a Google Colab (though I needed to beef up the default machine) and it worked as expected:

Screenshot 2024-07-12 at 2.25.35 PM.png

Fixie.ai org
edited Jul 12

Btw, you can define turns instead of this too:

turns = [
    {"role": "system", "content": "You are a friendly and helpful character. You love to answer questions for people."},
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/ubuntu/ultravox/test.py:10: UserWarning: PySoundFile failed. Trying audioread instead.
audio, sr = librosa.load(path, sr=16000)
/home/ubuntu/ultravox/venv_u/lib/python3.10/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
Deprecated as of librosa version 0.10.0.
It will be removed in librosa version 1.0.
y, sr_native = __audioread_load(path, offset, duration, dtype)
Hi there! I'm so happy to help you with any questions or concerns you may have. Please feel free to ask me anything, and I'll

I dont get the output

Can you list the versions of the libraries you are using?

Fixie.ai org
edited Jul 14

Okay so you are getting an output. It's just not a good one.

Can you folks list the exact versions of the libraries that you have used?

im getting an error trying to use the model
ValueError: Whisper expects the mel input features to be of length 3000 or less, but found 25603. Make sure to pad the input mel features to 3000.

Fixie.ai org
edited Jul 30

What's the code you are using to download the model?
That error hints that our whisper_model_modified.py is not being used.

Actually 25603 is huge! The model can currently only handle an audio that is less than 30 seconds. It's not hard to increase that limit a bit, but giving it an audio that is 256 seconds long (if I have the math right) doesn't seem to fit the use-case that Ultravox was designed for.

ah, ok that makes sense. I was inputing a full call transcription to have it respond with data about it.
Thanks.

Fixie.ai org

Oh that's interesting. How did you turn function call into audio? Human generated or TTS? I'm asking because most TTS providers don't handle code-related transcripts very well.

Sign up or log in to comment