view post Post 3499 One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. π£οΈπ₯π€ Vision-to-VibeVoice-en [Demo]: prithivMLmods/Vision-to-VibeVoice-enβ¨ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementationsβ¨ Speech [VibeVoice-Realtime-0.5B]: microsoft/VibeVoice-Realtime-0.5Bβ¨ Vision [Qwen2.5-VL]: Qwen/Qwen2.5-VL-7B-InstructTo know more about it, visit the app page or the respective model page! See translation 6 replies Β· π€ 7 7 π 3 3 β€οΈ 2 2 π₯ 1 1 + Reply
view article Article Weβre open-sourcing our text-to-image model and the process behind it 28 days ago β’ 74