|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: audio-text-to-text |
|
|
library_name: transformers |
|
|
language: en |
|
|
datasets: |
|
|
- YirongSun/LLaSO-Align |
|
|
- YirongSun/LLaSO-Instruct |
|
|
- YirongSun/LLaSO-Eval |
|
|
--- |
|
|
|
|
|
# LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models |
|
|
|
|
|
This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs). |
|
|
|
|
|
LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks. |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a> |
|
|
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a> |
|
|
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a> |
|
|
<br> |
|
|
<a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a> |
|
|
<a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a> |
|
|
<a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a> |
|
|
</p> |
|
|
|
|
|
* **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418) |
|
|
* **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO) |
|
|
|
|
|
## 🔍 What is LLaSO? |
|
|
**LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.** |
|
|
|
|
|
The framework provides three essential resources: |
|
|
- **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space. |
|
|
- **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives. |
|
|
- **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization. |
|
|
- **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance"> |
|
|
</p> |
|
|
<p align="center"><i> |
|
|
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories. |
|
|
</i></p> |
|
|
|
|
|
## ✨ Key Features |
|
|
- **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling. |
|
|
- **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks. |
|
|
- **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability. |
|
|
- **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research. |
|
|
- **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps. |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br> |
|
|
<i>Architecture & Two-Stage Training</i> |
|
|
</p> |
|
|
|
|
|
## 🚀 Usage |
|
|
For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO). |
|
|
|
|
|
## 📑 How to Cite |
|
|
If you use LLaSO in your research or applications, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{sun2025llaso, |
|
|
title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model}, |
|
|
author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen}, |
|
|
year={2025}, |
|
|
eprint={2508.15418}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2508.15418}, |
|
|
} |
|
|
``` |