README.md · OuteAI/OuteTTS-0.1-350M at main

metadata

license: cc-by-4.0
datasets:
  - facebook/multilingual_librispeech
  - parler-tts/libritts_r_filtered
language:
  - en
pipeline_tag: text-to-speech

OuteAI

🌎 OuteAI.com 🤝 Join our Discord 𝕏 @OuteAI

🤗 Hugging Face - OuteTTS 0.1 350M 🤗 Hugging Face - OuteTTS 0.1 350M GGUF 🤗 Hugging Face - Demo GitHub - OuteTTS

Model Description

A newer version of this model is available: OuteTTS-0.2-500M

OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture using our Oute3-350M-DEV base model, it demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.

Key Features

Pure language modeling approach to TTS
Voice cloning capabilities
LLaMa architecture
Compatible with llama.cpp and GGUF format

Technical Details

The model utilizes a three-step approach to audio processing:

Audio tokenization using WavTokenizer (processing 75 tokens per second)
CTC forced alignment for precise word-to-audio token mapping
Structured prompt creation following the format:

[full transcription]
[word] [duration token] [audio tokens]

Technical Blog

https://www.outeai.com/blog/OuteTTS-0.1-350M

Limitations

Being an experimental v0.1 release, there are some known issues:

Vocabulary constraints due to training data limitations
String-only input support
Given its compact 350M parameter size, the model may frequently alter, insert, or omit wrong words, leading to variations in output quality.
Variable temperature sensitivity depending on use case
Performs best with shorter sentences, as accuracy may decrease with longer inputs

Speech Samples

Listen to samples generated by OuteTTS-0.1-350M:

Input	Audio	Notes
Hello, I can speak pretty well, but sometimes I make some mistakes.		(temperature=0.1, repetition_penalty=1.1)
Once upon a time, there was a		(temperature=0.1, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		Using the Q4_K_M quantized model. (temperature=0.7, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		The model partially failed to follow the input text. (temperature=0.1, repetition_penalty=1.1)
Scientists have discovered a new planet that may be capable of supporting life!		In this case, changing to a higher temperature from 0.1 to 0.7 produces more consistent output. (temperature=0.7, repetition_penalty=1.1)

Installation

pip install outetts

Usage

The example below works with older outetts version (==0.1.7). The new version (>=0.2.0) introduces changes to the interface. Please refer to the GitHub Usage Example for updated examples.

Interface Usage

from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF

# Initialize the interface with the Hugging Face model
interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")

# Or initialize the interface with a GGUF model
# interface = InterfaceGGUF("path/to/model.gguf")

# Generate TTS output
# Without a speaker reference, the model generates speech with random speaker characteristics
output = interface.generate(
    text="Hello, am I working?",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096
)

# Play the generated audio
output.play()

# Save the generated audio to a file
output.save("output.wav")

Voice Cloning

# Create a custom speaker from an audio file
speaker = interface.create_speaker(
    "path/to/reference.wav",
    "reference text matching the audio"
)

# Generate TTS with the custom voice
output = interface.generate(
    text="This is a cloned voice speaking",
    speaker=speaker,
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096
)

Model Details

Model Type: LLaMa-based language model
Size: 350M parameters
Language Support: English
License: CC BY 4.0
Speech Datasets Used:
- LibriTTS-R (CC BY 4.0)
- Multilingual LibriSpeech (MLS) (CC BY 4.0)

Future Improvements

Scaling up parameters and training data
Exploring alternative alignment methods for better character compatibility
Potential expansion into speech-to-speech assistant models

Credits

WavTokenizer: https://github.com/jishengpeng/WavTokenizer
CTC Forced Alignment: https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html

Disclaimer

By using this model, you acknowledge that you understand and assume the risks associated with its use. You are solely responsible for ensuring compliance with all applicable laws and regulations. We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages. We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.