File size: 5,170 Bytes
12a5340 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
language: en
datasets:
- YirongSun/LLaSO-Align
- YirongSun/LLaSO-Instruct
- YirongSun/LLaSO-Eval
---
# LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models
This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).
LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
<p align="center">
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a>
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a>
<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a>
<br>
<a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a>
<a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a>
<a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a>
</p>
* **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418)
* **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO)
## 🔍 What is LLaSO?
**LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.**
The framework provides three essential resources:
- **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space.
- **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
- **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
- **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
<p align="center">
<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance">
</p>
<p align="center"><i>
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
</i></p>
## ✨ Key Features
- **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
- **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
- **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
- **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
- **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.
<p align="center">
<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br>
<i>Architecture & Two-Stage Training</i>
</p>
## 🚀 Usage
For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO).
## 📑 How to Cite
If you use LLaSO in your research or applications, please cite our paper:
```bibtex
@misc{sun2025llaso,
title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
year={2025},
eprint={2508.15418},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.15418},
}
``` |