Papers
arxiv:2509.09873

From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

Published on Sep 11
· Submitted by Leo on Sep 23
Authors:
Hao Li ,

Abstract

The study audits licenses in the Hugging Face ecosystem, revealing systemic non-compliance and proposing a rule engine to detect and resolve license conflicts in open-source AI.

AI-generated summary

Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.

Community

Paper author Paper submitter

TL;DR

  • We audited 364k datasets, 1.6M models, and 140k GitHub repos.
  • License drift is systemic: 35.5% of model → application transitions drop upstream restrictions by relicensing under permissive terms.
  • ML-specific obligations vanish at integration: only 0.4% are retained in repos.
  • We built LicenseRec, an AI-aware rule engine (≈200 SPDX & ML clauses) that can resolve 86.4% of model → app conflicts with compliant license recommendations.

Why it matters

Courts are still figuring out AI + copyright, but the cost of getting it wrong is already huge. The community is unintentionally treating models like simple libraries, stripping use-based and share-alike obligations at the final mile. That’s a governance and legal risk — and a fixable one.

Key findings

  • 📉 Non-permissive obligations collapse at model → repo; permissive dominates (91.1%).
  • 🧲 A strong gravitational pull to permissive licensing across the chain.
  • 🧩 Most violations are fixable via better license selection; a core remains unresolvable (e.g., NC upstream) and requires different component choices.
  • 🔧 Our ML-aware matrix catches conflicts traditional matrices miss, especially use-based ML license terms.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.09873 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.09873 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.09873 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.