3 7 5

Ann Huang PRO

erinys

https://huggingface.co/erinys

AI & ML interests

None yet

Recent Activity

Reacted to elliesleightholm's post with 🤗 2 days ago

I made a beginners guide to Hugging Face Spaces 🤗 I hope it's useful to some of you :) YouTube video: https://www.youtube.com/watch?v=xqdTFyRdtjQ Blog: https://www.marqo.ai/blog/how-to-create-a-hugging-face-space

Reacted to jsulz's post with 🔥 2 days ago

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in. Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means: ⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub. We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows? https://huggingface.co/blog/from-files-to-chunks

Reacted to reach-vb's post with 🚀 4 days ago

What a brilliant week for Open Source AI! Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17% https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents https://huggingface.co/collections/Nexusflow/athene-v2-6735b85e505981a794fb02cc Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1 Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder https://huggingface.co/collections/reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71 JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow https://huggingface.co/deepseek-ai/JanusFlow-1.3B Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens! https://huggingface.co/datasets/PleIAs/common_corpus I'm sure I missed a lot, can't wait for the next week! Put down in comments what I missed! 🤗

View all activity

Articles

From Files to Chunks: Improving Hugging Face Storage Efficiency

4 days ago

• 32

Share your open ML datasets on Hugging Face Hub!

12 days ago

• 20

Organizations

erinys's activity

Reacted to elliesleightholm's post with 🤗 2 days ago

Post

2560

I made a beginners guide to Hugging Face Spaces 🤗 I hope it's useful to some of you :)

YouTube video: https://www.youtube.com/watch?v=xqdTFyRdtjQ

Blog: https://www.marqo.ai/blog/how-to-create-a-hugging-face-space

8 replies

Reacted to jsulz's post with 🔥 2 days ago

Post

2783

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks

Reacted to reach-vb's post with 🚀🔥 4 days ago

Post

3986

What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! 🤗

Reacted to maxiw's post with 🤗❤️ 10 days ago

Post

4530

I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
❤️: 7048 times
🔥: 5921 times
👍: 4856 times
🚀: 2549 times
🤗: 2065 times

10 replies

liked a dataset 24 days ago

openfoodfacts/product-database

Viewer • Updated about 11 hours ago • 3.55M • 319 • 8

liked a Space 29 days ago

Running

📈

Hub Stats

updated a Space 29 days ago

Running

🔥

README

posted an update about 1 month ago

Post

2127

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

upvoted an article about 1 month ago

Article

How to optimize your data labelling project with custom interfaces

•

Oct 16

• 18

Reacted to jsulz's post with 🔥 about 1 month ago

Post

1650

The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co/xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.

Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore:
- Storage growth over time
- File types over all repositories
- Some simple optimizations we're investigating

xet-team/lfs-analysis

New activity in xet-team/lfs-analysis about 1 month ago

Compressed -> Deduped column header

#4 opened about 1 month ago by

erinys

liked a Space about 2 months ago

Running

📈

Hub LFS Analysis

An analysis of LFS files on the Hub.

upvoted an article about 2 months ago

Article

Improving Parquet Dedupe on Hugging Face Hub

Oct 5

• 31

New activity in xet-team/lfs-analysis about 2 months ago

Suggested text changes

#1 opened about 2 months ago by

erinys

replied to their post about 2 months ago

This is great feedback @John6666 - and I've seen your suggestions in the other thread as well. As a non-ML engineer myself, it's been really interesting to explore HF with fresh eyes! We're doing some early exploration on HF understandability and discoverability in our team - would you be open to chatting sometime about potential approaches? We'd love to get your feedback!

liked a dataset about 2 months ago

argilla/FinePersonas-v0.1

Viewer • Updated Sep 18 • 21.1M • 2.29k • 319

posted an update about 2 months ago

Post

1938

We shut down XetHub today after almost 2 years. What we learned from launching our Git-scaled product from scratch:
- Don't make me change my workflow
- Data inertia is real
- ML best practices are still evolving

Closing the door on our public product lets us focus on our new goal of scaling HF Hub's storage backend to improve devX for a larger community. We'd love to hear your thoughts on what experiences we can improve!

Read the full post: https://xethub.com/blog/shutting-down-xethub-learnings-and-takeaways

6 replies

Reacted to merve's post with 🔥 about 2 months ago

Post

3984

If you feel like you missed out for ECCV 2024, there's an app to browse the papers, rank for popularity, filter for open models, datasets and demos 📝

Get started at ECCV/ECCV2024-papers ✨