|
--- |
|
license: |
|
- mit |
|
- apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: audio-classification |
|
tags: |
|
- audio |
|
- tts |
|
--- |
|
|
|
# StyleTTS 2 Detector |
|
|
|
This is a model trained for audio classification on a dataset of almost 10,000 samples of human and StyleTTS 2-generated audio clips. The model is based on [Whisper](https://huggingface.co/openai/whisper-base). |
|
|
|
**NOTE: This model is not affiliated with the author(s) of StyleTTS 2 in any way.** |
|
|
|
**NOTE: This model only aims to detect audio generated by StyleTTS 2 and DOES NOT work for audio generated by other TTS models or finetunes. I'm aiming to create a universal classifier in the future.** |
|
|
|
## Online Demo |
|
|
|
An online demo is available [here](https://huggingface.co/spaces/mrfakename/styletts2-detector). |
|
|
|
## Usage |
|
|
|
**IMPORTANT:** Please read the license, disclaimer, and model card before using the model. You may not use the model if you do not agree to the license and disclaimer. |
|
|
|
```python |
|
from transformers import pipeline |
|
import torch |
|
|
|
pipe = pipeline('audio-classification', model='mrfakename/styletts2-detector', device='cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
result = pipe('audio.wav') |
|
|
|
print(result) |
|
``` |
|
|
|
## Tags |
|
|
|
The audio will be classified as either `real` or `fake` (human-generated or StyleTTS 2-spoken). |
|
|
|
## Disclaimer |
|
|
|
The author(s) of this model cannot guarantee complete accuracy. False positives or negatives may occur. |
|
|
|
Usage of this model should not replace other precautions, such as invisible watermarking or audio watermarking. |
|
|
|
This model has been trained on outputs from the StyleTTS 2 base model, not fine-tunes. The model may not identify fine-tunes properly. |
|
|
|
The author(s) of this model disclaim all liability related to or in connection with the usage of this model. |
|
|
|
## Training Data |
|
|
|
This model was trained on the following data: |
|
|
|
* A dataset of real human audio and synthetic audio generated by StyleTTS 2 |
|
* A subset of the LibriTTS-R dataset, which is licensed under the CC-BY 4.0 license and includes public domain audio. |
|
* A custom synthetic dataset derived from a subset of the LibriTTS-R dataset and synthesized using StyleTTS 2. Text from the LibriTTS-R dataset were used as prompts for StyleTTS 2. The StyleTTS 2 model used was trained on the LibriTTS dataset. |
|
|
|
## License |
|
|
|
You may use this model under either the **MIT** or **Apache 2.0** license, at your choice, so long as you include the disclaimer above in all redistributions, and require all future redistributions to include the disclaimer. |
|
|
|
This model was trained partially on data from the [LibriTTS dataset](http://www.openslr.org/60/), the [LibriTTS-R dataset](https://google.github.io/df-conformer/librittsr/), and/or data generated using [StyleTTS 2](https://arxiv.org/abs/2306.07691) (which was trained on the LibriTTS dataset). |