Evaluating Language Models as Synthetic Data Generators
Abstract
Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.
Community
We release AgoraBench, a benchmark that compares data generation capabilities of LMs!
Code, leaderboard, data, checkpoints will soon be released, stay tuned!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Training and Evaluating Language Models with Template-based Data Generation (2024)
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models (2024)
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models (2024)
- Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation (2024)
- Evaluating Large Language Model Capability in Vietnamese Fact-Checking Data Generation (2024)
- Little Giants: Synthesizing High-Quality Embedding Data at Scale (2024)
- A Survey on Data Synthesis and Augmentation for Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper