GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

TL;DR This is the main Hugging Face checkpoint repo for GPA v1.5. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at AutoArk-AI/GPA-v1.5-onnx-runtime.

What Is GPA v1.5?

GPA stands for General Purpose Audio.

GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:

ASR: automatic speech recognition.
TTS: text-to-speech with reference voice conditioning.
Training / fine-tuning: native Hugging Face Trainer workflow.
Deployment path: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.

Voice conversion support in the native v1.5 path is on the roadmap.

_{GPA unifies speech understanding and generation in a single autoregressive audio-language model.}

Hugging Face and GitHub Mapping

This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:

Goal	GitHub Entry Point	Hugging Face Assets
Native PyTorch / Hugging Face inference	`GPA_1.5/docs/infer.md`, `GPA_1.5/infer.py`	This repo: `AutoArk-AI/GPA-v1.5`
Fine-tuning / continued training	`GPA_1.5/docs/train.md`, `GPA_1.5/train.py`	This repo: `AutoArk-AI/GPA-v1.5`
ONNX CLI / FastAPI / browser UI runtime	`GPA_1.5/onnx_runtime/README.md`	`AutoArk-AI/GPA-v1.5-onnx-runtime`

Recommended Local Layout

For the least configuration, keep the checkpoint repos side by side:

GPA-v1.5/
GPA-v1.5-HF/
  GPA-v1.5/
    spark_tokenizer_model/
  GPA-v1.5-onnx-runtime/

What each path is used for:

GPA-v1.5-HF/GPA-v1.5: native PyTorch train / inference checkpoint.
GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model: Spark tokenizer assets used by native TTS.
GPA-v1.5-HF/GPA-v1.5-onnx-runtime: ONNX CLI / service / browser UI asset bundle.

With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.

Download

git clone https://github.com/AutoArk/GPA.git GPA-v1.5
mkdir -p GPA-v1.5-HF

huggingface-cli download AutoArk-AI/GPA-v1.5 \
  --local-dir GPA-v1.5-HF/GPA-v1.5

huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
  --local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime

Where To Start

Fine-tuning / continued training: GPA_1.5/docs/train.md
Native PyTorch inference: GPA_1.5/docs/infer.md
ONNX runtime deployment: GPA_1.5/onnx_runtime/README.md

GPA v1.5 Release Overview

	GPA v1.5
Checkpoint	Open-sourced on Hugging Face
Native inference	Direct PyTorch / Hugging Face execution for ASR and TTS
Native training	Fine-tuning and continued training with Hugging Face `Trainer`
ONNX runtime	CLI inference, FastAPI service, browser UI, voice registration, and runtime validation
Planned	Voice conversion support in the native v1.5 path

Evaluation Metric Results

TTS Evaluation

Model	Open-Source	Model Size	test-zh CER (%) ↓	test-zh Sim (%) ↑	test-en WER (%) ↓	test-en Sim (%) ↑
Human	-	-	1.26	75.5	2.14	73.4
Seed-TTS	No	-	1.12	79.6	2.25	76.2
MiniMax-Speech	No	-	0.83	78.3	1.65	69.2
F5-TTS	Yes	0.3B	1.52	74.1	2.00	64.7
CosyVoice2	Yes	0.5B	1.45	75.7	2.57	65.9
FireRedTTS2	Yes	1.5B	1.14	73.2	1.95	66.5
Index-TTS2	Yes	1.5B	1.03	76.5	2.23	70.6
VibeVoice-1.5B	Yes	1.5B	1.16	74.4	3.04	68.9
VoxCPM	Yes	0.5B	0.93	77.2	1.85	72.9
Fun-CosyVoice3-0.5B-2512_RL	Yes	0.5B	0.81	77.4	1.68	69.5
Spark TTS	Yes	0.5B	1.20	66.0	1.98	57.3
GPA-v1.5	Yes	0.6B	1.03	70.2	1.43	63.5

ASR Evaluation

WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.

Model	Model Size	LibriSpeech test-clean	LibriSpeech test-other	AISHELL-1	test_Meeting	test_Net
Whisper-S	0.24B	3.43	7.63	-	-	-
GPA-v1.5	0.6B	2.78	5.02	2.83	7.40	6.49
Fun-ASR-nano	0.8B	1.76	4.33	1.80	6.60	6.01
FireRed-ASR	1.1B	1.84	4.52	0.54	4.95	4.94
GLM-ASR-nano	1.5B	2.00	4.19	1.81	6.73	-
Whisper-L	1.55B	1.86	3.43	4.72	18.39	11.89
Kimi-Audio	-	1.32	2.63	0.71	6.24	6.45
Step-Audio2	-	1.17	2.42	0.63	4.75	4.67
Seed-ASR	-	1.58	2.84	0.68	5.69	4.66
Fun-ASR	7.7B	1.51	3.03	1.22	6.17	5.46

License

This model is released under the Apache 2.0 license.

Citation

If you find GPA useful for your research or projects, please cite us:

@misc{cai2026unifyingspeechrecognitionsynthesis,
      title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
      author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
      year={2026},
      eprint={2601.10770},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10770},
}