--- license: apache-2.0 pipeline_tag: audio-text-to-text library_name: transformers language: en datasets: - YirongSun/LLaSO-Align - YirongSun/LLaSO-Instruct - YirongSun/LLaSO-Eval --- # LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models This repository contains **LLaSO-Base-3.8B-Instruct**, a 3.8B-parameter reference model from the **LLaSO** framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs). LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
* **Paper:** [LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model](https://huggingface.co/papers/2508.15418) * **Code & Project Page:** [https://github.com/EIT-NLP/LLaSO](https://github.com/EIT-NLP/LLaSO) ## 🔍 What is LLaSO? **LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.** The framework provides three essential resources: - **LLaSO-Align (12.0M):** An ASR-based alignment corpus for grounding speech in textual semantic space. - **LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs):** A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives. - **LLaSO-Eval (15,044):** A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization. - **LLaSO-Base (3.8B):** This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
## ✨ Key Features - **Fully Open, End-to-End Stack:** Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling. - **25.5M Samples, 20 Tasks, 3 Modality Configurations:** Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks. - **Stratified Evaluation (15,044):** Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability. - **Robust Reference Model (3.8B):** Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research. - **Empirical Insights:** Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

Architecture & Two-Stage Training