Jared Sulzdorf's picture

Jared Sulzdorf PRO

jsulz

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

View all activity

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture

jsulz's activity

reacted to victor's post with 🚀🤗🔥❤️ about 2 hours ago
view post
Post
2863
Hey everyone, we've given https://hf.co/spaces page a fresh update!

Smart Search: Now just type what you want to do—like "make a viral meme" or "generate music"—and our search gets it.

New Categories: Check out the cool new filter bar with icons to help you pick a category fast.

Redesigned Space Cards: Reworked a bit to really show off the app descriptions, so you know what each Space does at a glance.

Random Prompt: Need ideas? Hit the dice button for a burst of inspiration.

We’d love to hear what you think—drop us some feedback plz!
·
reacted to clem's post with 🔥🚀 13 days ago
reacted to merve's post with 🔥 13 days ago
view post
Post
4988
Oof, what a week! 🥵 So many things have happened, let's recap! merve/jan-24-releases-6793d610774073328eac67a9

Multimodal 💬
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG 💗
- UI-TARS are new models by ByteDance to unlock agentic GUI control 🤯 in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark

LLMs 📖
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! 🤯
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)

Audio 🗣️
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO

Image/Video/3D Generation ⏯️
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images
·
reacted to julien-c's post with 🤗❤️🔥 about 2 months ago
view post
Post
9276
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
reacted to dvilasuero's post with 🔥❤️ 2 months ago
view post
Post
2352
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
reacted to fdaudens's post with 🧠 2 months ago
replied to their post 2 months ago
view reply

I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:

Screenshot 2024-12-06 at 9.58.49 AM.png

But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.

This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.

Here's facebook/multilingual_librispeech:

newplot (28).png

and mozilla-foundation/common_voice_17_0:

newplot (29).png

and google/xtreme:

newplot (30).png

and unolp/CulturaX:

newplot (31).png

Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.

Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:

newplot (32).png

Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co/spaces/jsulz/repo-info and see it for yourself ;)

replied to their post 2 months ago
view reply

Datasets are among my favorite to visualize because of their mixture of files and folder structures. Here's the huggingface/documentation-images where alongside documentation images we store images for the Hugging Face blog:

I also enjoy the wikimedia/wikipedia dataset. It's fascinating to see the distribution of bytes across languages.

Some datasets are actually quite difficult to visualize because the number of points in the Plotly graph cause the browser to crash on render. It's quite possible you'll run into this if you use the Space. A simple check for file count could help, but for now I find myself running it a few times just to see if I can grab the image. allenai is home to many such datasets, but I eventually found allenai/paloma a eval dataset, that I could visualize

For some of these larger datasets, I might run things locally and write the image out to see if there are any interesting findings.

posted an update 2 months ago
view post
Post
1382
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co/blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

  • 2 replies
·
reacted to clem's post with 🔥🚀 2 months ago
view post
Post
4614
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):

- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.

How my predictions for 2024 turned out:

- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
✅ (Inflexion, AdeptAI,...)

- Open-source LLMs will reach the level of the best closed-source LLMs
✅ with QwQ and dozens of others

- Big breakthroughs in AI for video, time-series, biology and chemistry
✅ for video 🔴for time-series, biology and chemistry

- We will talk much more about the cost (monetary and environmental) of AI
✅Monetary 🔴Environmental (😢)

- A popular media will be mostly AI-generated
✅ with NotebookLM by Google

- 10 millions AI builders on Hugging Face leading to no increase of unemployment
🔜currently 7M of AI builders on Hugging Face
·
reacted to cfahlgren1's post with 👍🔥 2 months ago
view post
Post
3028
We just dropped an LLM inside the SQL Console 🤯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think 🤗