Audio-Text-to-Text
Transformers
Safetensors
English
llava_llama
text-generation
File size: 5,170 Bytes
12a5340
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
language: en
datasets:
  - YirongSun/LLaSO-Align
  - YirongSun/LLaSO-Instruct
  - YirongSun/LLaSO-Eval
---

# LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models

This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).

LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.

<p align="center">
  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a>
  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a>
  <a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a>
  <br>
  <a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a>
  <a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a>
  <a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a>
</p>

*   **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418)
*   **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO)

## 🔍 What is LLaSO?
**LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.**

The framework provides three essential resources:
-   **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space.
-   **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
-   **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
-   **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.

<p align="center">
  <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance">
</p>
<p align="center"><i>
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
</i></p>

## ✨ Key Features
-   **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
-   **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
-   **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
-   **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
-   **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

<p align="center">
  <img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br>
  <i>Architecture & Two-Stage Training</i>
</p>

## 🚀 Usage
For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO).

## 📑 How to Cite
If you use LLaSO in your research or applications, please cite our paper:

```bibtex
@misc{sun2025llaso,
      title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model}, 
      author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
      year={2025},
      eprint={2508.15418},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.15418}, 
}
```