Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Abstract
Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.
Community
We introduce Mimic Score, a data quality metric that leverages a pretrained reference model to evaluate the usefulness of data samples for training a new model. This metric assesses the alignment between the gradient of the new model's parameters and the vector pointing toward the reference model in weight space. Samples that deviate significantly from this alignment are deemed low-value and can be filtered out. Building on the Mimic Score, we propose Grad-Mimic, a data selection framework that automates the identification and prioritization of high-value samples, enabling the creation of effective filters.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Investigating the Impact of Data Selection Strategies on Language Model Performance (2025)
- Navigating Towards Fairness with Data Selection (2024)
- Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness (2024)
- ResoFilter: Fine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis (2024)
- General Information Metrics for Improving AI Model Training Efficiency (2025)
- Boosting LLM via Learning from Data Iteratively and Selectively (2024)
- Pruning-based Data Selection and Network Fusion for Efficient Deep Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper