mrfakename
/

styletts2-detector

Audio Classification

Inference Endpoints

Model card Files Files and versions Community

styletts2-detector / README.md

mrfakename's picture

Super-squash branch 'main' using huggingface_hub

494c78a verified 10 days ago

|

2.81 kB

	---
	license:
	- mit
	- apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: audio-classification
	tags:
	- audio
	- tts
	---

	# StyleTTS 2 Detector

	This is a model trained for audio classification on a dataset of almost 10,000 samples of human and StyleTTS 2-generated audio clips. The model is based on [Whisper](https://huggingface.co/openai/whisper-base).

	NOTE: This model is not affiliated with the author(s) of StyleTTS 2 in any way.

	NOTE: This model only aims to detect audio generated by StyleTTS 2 and DOES NOT work for audio generated by other TTS models or finetunes. I'm aiming to create a universal classifier in the future.

	## Online Demo

	An online demo is available [here](https://huggingface.co/spaces/mrfakename/styletts2-detector).

	## Usage

	IMPORTANT: Please read the license, disclaimer, and model card before using the model. You may not use the model if you do not agree to the license and disclaimer.

	```python
	from transformers import pipeline
	import torch

	pipe = pipeline('audio-classification', model='mrfakename/styletts2-detector', device='cuda' if torch.cuda.is_available() else 'cpu')

	result = pipe('audio.wav')

	print(result)
	```

	## Tags

	The audio will be classified as either `real` or `fake` (human-generated or StyleTTS 2-spoken).

	## Disclaimer

	The author(s) of this model cannot guarantee complete accuracy. False positives or negatives may occur.

	Usage of this model should not replace other precautions, such as invisible watermarking or audio watermarking.

	This model has been trained on outputs from the StyleTTS 2 base model, not fine-tunes. The model may not identify fine-tunes properly.

	The author(s) of this model disclaim all liability related to or in connection with the usage of this model.

	## Training Data

	This model was trained on the following data:

	* A dataset of real human audio and synthetic audio generated by StyleTTS 2
	* A subset of the LibriTTS-R dataset, which is licensed under the CC-BY 4.0 license and includes public domain audio.
	* A custom synthetic dataset derived from a subset of the LibriTTS-R dataset and synthesized using StyleTTS 2. Text from the LibriTTS-R dataset were used as prompts for StyleTTS 2. The StyleTTS 2 model used was trained on the LibriTTS dataset.

	## License

	You may use this model under either the MIT or Apache 2.0 license, at your choice, so long as you include the disclaimer above in all redistributions, and require all future redistributions to include the disclaimer.

	This model was trained partially on data from the [LibriTTS dataset](http://www.openslr.org/60/), the [LibriTTS-R dataset](https://google.github.io/df-conformer/librittsr/), and/or data generated using [StyleTTS 2](https://arxiv.org/abs/2306.07691) (which was trained on the LibriTTS dataset).