Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
BramVanroy 's Collections
CommonCrawl-Creative Commons (C5)
Fietje 2
🐐 GEITje 7B ultra 🤖
SFT & RL datasets for Dutch
Dutch Simplification
Multilingual text-to-AMR
Leesplank 2023-2024
Llama 2 & Falcon finetunes
BLEURT

CommonCrawl-Creative Commons (C5)

updated Aug 15

Raw CommonCrawl crawls, annotated with Creative Commons license information

Upvote
-

  • BramVanroy/CommonCrawl-CreativeCommons

    Viewer • Updated Aug 28 • 739M • 1.9k • 34

  • BramVanroy/CommonCrawl-CreativeCommons-fine

    Viewer • Updated Aug 28 • 75.1M • 278 • 2

    Note Only retaining samples that are also present in FineWeb or FineWeb-2


  • BramVanroy/CommonCrawl-CreativeCommons-strict

    Viewer • Updated Aug 28 • 32.8M • 1.96k • 1

    Note Strong filters, only retaining FineWeb data, removing non-commercial data, removing Wiki data

Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs