arxiv:2501.06708

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Published on Jan 12

· Submitted by

mbilkhu on Jan 14

Upvote

Authors:

Manjot Bilkhu ,

Abstract

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

View arXiv page View PDF Add to collection

Community

mbilkhu

Paper author Paper submitter 2 days ago

We introduce Mimic Score, a data quality metric that leverages a pretrained reference model to evaluate the usefulness of data samples for training a new model. This metric assesses the alignment between the gradient of the new model's parameters and the vector pointing toward the reference model in weight space. Samples that deviate significantly from this alignment are deemed low-value and can be filtered out. Building on the Mimic Score, we propose Grad-Mimic, a data selection framework that automates the identification and prioritization of high-value samples, enabling the creation of effective filters.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.06708 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.06708 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.06708 in a Space README.md to link it from this page.