๐๏ธ I don't think the collections feature of Hugging Face is widely used, even though it's an excellent way to organize and discover interesting resources. To do my bit to change that, I've created two carefully curated collections that combine both my original work and other valuable datasets:
Educational Datasets - Mostly English-Russian, but other languages are also included - Extended by my new Begemot.ai dataset (2.7M+ Russian education records) nyuuzyou/begemot
- Extensive art-focused collection, including my new datasets: - Buzzly.art (2K artworks) nyuuzyou/buzzlyart - Paintberri (60K+ pieces) nyuuzyou/paintberri - Itaku.ee (924K+ items) nyuuzyou/itaku - Extended with other amazing datasets from the community
Collections should become a more common feature - hopefully this will encourage others to create and share their own curated collections. By organizing related datasets into these themed collections, I hope to make it easier for researchers and developers to discover and use these valuable resources.
Multimodal ๐ผ๏ธ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook โ 22k hours worth of samples from instruction videos ๐คฏ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs ๐ฌ > Microsoft released Phi-4, sota open-source 14B language model ๐ฅ > Dolphin is back with Dolphin 3.0 Llama 3.1 8B ๐ฌ๐ฌ > Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment > SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct ๐ญ > Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview ๐ > Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs ๐ > Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences ๐ฉ๐ปโ๐ป
Embeddings ๐ > @MoritzLaurer released zero-shot version of ModernBERT large ๐ > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation โฏ๏ธ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts ๐ฅ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI.
We have several key changes:
- New chat UI built with NextJS - DuckDuckGo API used for web search instead of Google - Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI) - Crawl4AI crawler used for web scraping - Optimizations in the data workflow inside the application
I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI.
We have several key changes:
- New chat UI built with NextJS - DuckDuckGo API used for web search instead of Google - Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI) - Crawl4AI crawler used for web scraping - Optimizations in the data workflow inside the application
๐ต Polymarket is leveraging โChatbot Arena LLM Leaderboardโ on HuggingFace for online gambling on the โTop AI model on January 31?โ. ๐ค
As of January 3rd, 2025: -1./ Gemini (83%) -2./ ChatGPT (13%) -3./ Other (2%) -4./ Claude (2%) -5./ Grok (1%) -6./ Llama (<1%)
๐บ๐ธ The market opinion is following historical data. It's clearly bias towards US historical AI giants, yet Polymarket is forbidden in the USA and for US citizens.
๐จ๐ณ In the โOtherโ, you might have Chinese AI labs that are probably the future AI leaders (Qwen, DeepSeek, Yi).
โ๏ธ In the market resolution, if two models are tied in the evaluation, they will take the alphabetical order. (e.g. if both were tied, โGoogleโ would resolve to โYesโ, and โxAIโ would resolve to โNoโ). ๐
That might be illegal usage of the Chatbot Arena policy? And maybe HuggingFace? @clem Or maybe authors and contributors should get a cut each month as โmarket markersโ.ย @weichiang@angelopoulos
3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 ๐) to the ranking of best LLMs in Arabic !
Observations: - DeepSeek-v3 ranked 3rd and only Open model among the top 5 !
- A 14B open model (Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !
- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct. It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining. Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)
damn I love nvidia's bullish stance on taking AI to the edge - from being the overlord of compute to cutting-edge physical AI with SOTA multiverse simulation engines that brings the scaling laws under your control!!
My favorite: Cosmos - fully opensourced, open-weight physics based video gen platform, what an incredible way to start off the yearโจ
OpenAI is losing money on the $200/month subscription ๐คฏ. It's crazy how expensive it is to run these largest LLMs:
- ChatGPT Pro costs $200/month ($2,400/year) and is still unprofitable for OpenAI due to higher-than-expected usage. - OpenAI reportedly expected losses of about $5 billion on revenue of $3.7 billion last year, with ChatGPT alone once costing an estimated $700,000 per day to operate. ๐ธ๐ฅ - They build strong models and do great research. Whether this business model will work in the long run is one of the biggest questions in the AI economy today.
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.