π€ 93% of Gen Z workers use AI tools weekly, but nearly half of all workers aren't comfortable admitting it. The tech adoption gap isn't about usageβit's about openness. Why are we still treating AI tools like a workplace secret? π€
- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
I created something called 'Hyperbolic Embeddings'. I literally just embed the tokens into Hyperbolic Space instead of Euclidean space. At first, this did not get me the gains I was expecting. I was a sad panda. Then I thought about it, a Hyperbolic Embedding needs a Hyperbolic Optimizer. So, instead of Adam, I used Riemannian Adam (RAdam). "Ladies and Gentlemen, We Got 'Em!"
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. Thatβs where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isnβt just a performance boost. Itβs a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?