TL;DR This is the main Hugging Face checkpoint repo for GPA v1.5. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at AutoArk-AI/GPA-v1.5-onnx-runtime.
What Is GPA v1.5?
GPA stands for General Purpose Audio.
GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:
- ASR: automatic speech recognition.
- TTS: text-to-speech with reference voice conditioning.
- Training / fine-tuning: native Hugging Face
Trainerworkflow. - Deployment path: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.
Voice conversion support in the native v1.5 path is on the roadmap.
GPA unifies speech understanding and generation in a single autoregressive audio-language model.
Hugging Face and GitHub Mapping
This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:
| Goal | GitHub Entry Point | Hugging Face Assets |
|---|---|---|
| Native PyTorch / Hugging Face inference | GPA_1.5/docs/infer.md, GPA_1.5/infer.py |
This repo: AutoArk-AI/GPA-v1.5 |
| Fine-tuning / continued training | GPA_1.5/docs/train.md, GPA_1.5/train.py |
This repo: AutoArk-AI/GPA-v1.5 |
| ONNX CLI / FastAPI / browser UI runtime | GPA_1.5/onnx_runtime/README.md |
AutoArk-AI/GPA-v1.5-onnx-runtime |
Recommended Local Layout
For the least configuration, keep the checkpoint repos side by side:
GPA-v1.5/
GPA-v1.5-HF/
GPA-v1.5/
spark_tokenizer_model/
GPA-v1.5-onnx-runtime/
What each path is used for:
GPA-v1.5-HF/GPA-v1.5: native PyTorch train / inference checkpoint.GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model: Spark tokenizer assets used by native TTS.GPA-v1.5-HF/GPA-v1.5-onnx-runtime: ONNX CLI / service / browser UI asset bundle.
With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.
Download
git clone https://github.com/AutoArk/GPA.git GPA-v1.5
mkdir -p GPA-v1.5-HF
huggingface-cli download AutoArk-AI/GPA-v1.5 \
--local-dir GPA-v1.5-HF/GPA-v1.5
huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
--local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime
Where To Start
- Fine-tuning / continued training: GPA_1.5/docs/train.md
- Native PyTorch inference: GPA_1.5/docs/infer.md
- ONNX runtime deployment: GPA_1.5/onnx_runtime/README.md
GPA v1.5 Release Overview
| GPA v1.5 | |
|---|---|
| Checkpoint | Open-sourced on Hugging Face |
| Native inference | Direct PyTorch / Hugging Face execution for ASR and TTS |
| Native training | Fine-tuning and continued training with Hugging Face Trainer |
| ONNX runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation |
| Planned | Voice conversion support in the native v1.5 path |
Evaluation Metric Results
TTS Evaluation
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
|---|---|---|---|---|---|---|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | No | - | 1.12 | 79.6 | 2.25 | 76.2 |
| MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-v1.5 | Yes | 0.6B | 1.03 | 70.2 | 1.43 | 63.5 |
ASR Evaluation
WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.
| Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net |
|---|---|---|---|---|---|---|
| Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - |
| GPA-v1.5 | 0.6B | 2.78 | 5.02 | 2.83 | 7.40 | 6.49 |
| Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 |
| FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 |
| GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - |
| Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 |
| Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 |
| Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 |
| Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 |
| Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 |
License
This model is released under the Apache 2.0 license.
Citation
If you find GPA useful for your research or projects, please cite us:
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}
- Downloads last month
- 28