Update README.md

c193b2e verified 4 months ago

5.17 kB

	---
	license: apache-2.0
	pipeline_tag: audio-text-to-text
	library_name: transformers
	language: en
	datasets:
	- YirongSun/LLaSO-Align
	- YirongSun/LLaSO-Instruct
	- YirongSun/LLaSO-Eval
	---

	# LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models

	This repository contains LLaSO-Base-3.8B-Instruct, a 3.8B-parameter reference model from the LLaSO framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).

	LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.

	<p align="center">
	<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Align"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Align-16a085.svg" alt="HF Align"></a>
	<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Instruct"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Instruct-1abc9c.svg" alt="HF Ins"></a>
	<a href="https://huggingface.co/datasets/YirongSun/LLaSO-Eval"><img src="https://img.shields.io/badge/HF%20Dataset-LLaSO--Eval-27ae60.svg" alt="HF Eval"></a>
	<br>
	<a href="https://huggingface.co/papers/2508.15418"><img src="https://img.shields.io/badge/arXiv-2508.15418-B31B1B.svg" alt="arXiv"></a>
	<a href="https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct"><img src="https://img.shields.io/badge/HuggingFace-Model-ffcc00.svg" alt="HF Model"></a>
	<a href="https://github.com/EIT-NLP/LLaSO"><img src="https://img.shields.io/github/stars/EIT-NLP/LLaSO?style=social" alt="GitHub Stars"></a>
	</p>

	* Paper: [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418)
	* Code & Project Page: [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO)

	## 🔍 What is LLaSO?
	LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.

	The framework provides three essential resources:
	- LLaSO-Align (12.0M): An ASR-based alignment corpus for grounding speech in textual semantic space.
	- LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs): A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
	- LLaSO-Eval (15,044): A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
	- LLaSO-Base (3.8B): This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.

	<p align="center">
	<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/radar.png" width="600" alt="LLaSO overall performance">
	</p>
	<p align="center"><i>
	LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
	</i></p>

	## ✨ Key Features
	- Fully Open, End-to-End Stack: Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
	- 25.5M Samples, 20 Tasks, 3 Modality Configurations: Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
	- Stratified Evaluation (15,044): Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
	- Robust Reference Model (3.8B): Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
	- Empirical Insights: Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

	<p align="center">
	<img src="https://github.com/EIT-NLP/LLaSO/raw/main/figures/architecture_trim.png" width="350" alt="Architecture & Two-Stage Training (Figure 6)"><br>
	<i>Architecture & Two-Stage Training</i>
	</p>

	## 🚀 Usage
	For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the [LLaSO GitHub repository](https://github.com/EIT-NLP/LLaSO).

	## 📑 How to Cite
	If you use LLaSO in your research or applications, please cite our paper:

	```bibtex
	@misc{sun2025llaso,
	title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
	author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
	year={2025},
	eprint={2508.15418},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.15418},
	}
	```