Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Recent Activity

liked a dataset about 21 hours ago
mlabonne/orca-agentinstruct-1M-v1-cleaned
reacted to cfahlgren1's post with โค๏ธ about 21 hours ago
posted an update about 23 hours ago

Articles

Organizations

m-ric's activity

reacted to cfahlgren1's post with โค๏ธ about 21 hours ago
view post
Post
2030
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
ยท
posted an update about 23 hours ago
view post
Post
363
๐Ÿ” Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

๐Ÿค” 42%, is that not too low?
โžก๏ธ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
๐Ÿ’ก So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

๐Ÿ”„ Two-step approach:
โ€ฃ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
โ€ฃ Fine-tuned Llama 2 7B ranks the most likely culprits

๐ŸŽ“ Training pipeline:
โ€ฃ Continued pre-training on Meta's internal docs and wikis
โ€ฃ Supervised fine-tuning on past incident investigations
โ€ฃ Training data mimicked real-world constraints (2-20 potential changes per incident)

๐Ÿ”ฎ Now future developments await:
โ€ฃ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
โ€ฃ Improvements in model reasoning should boost accuracy further

Read it in full ๐Ÿ‘‰ https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
reacted to reach-vb's post with ๐Ÿ”ฅ 2 days ago
view post
Post
3914
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! ๐Ÿค—
posted an update 2 days ago
view post
Post
1272
Great feature alert: ๐—ฌ๐—ผ๐˜‚ ๐—ฐ๐—ฎ๐—ป ๐—ป๐—ผ๐˜„ ๐˜‚๐˜€๐—ฒ ๐—ฎ๐—ป๐˜† ๐—ฆ๐—ฝ๐—ฎ๐—ฐ๐—ฒ ๐—ฎ๐˜€ ๐—ฎ ๐˜๐—ผ๐—ผ๐—น ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฟ๐˜€.๐—ฎ๐—ด๐—ฒ๐—ป๐˜! ๐Ÿ› ๏ธ๐Ÿ”ฅ๐Ÿ”ฅ

This lets you take the coolest spaces, like FLUX.1-dev, and use them in agentic workflows with a few lines of code! ๐Ÿง‘โ€๐Ÿ’ป

On the video below, I set up my fake vacation pictures where I'm awesome at surfing (I'm really not) ๐Ÿ„

Head to the doc to learn this magic ๐Ÿ‘‰ https://huggingface.co/docs/transformers/main/en/agents_advanced#import-a-space-as-a-tool-
posted an update 6 days ago
view post
Post
360
๐— ๐—ฒ๐˜๐—ฎ ๐˜๐—ฒ๐—ฎ๐—บ ๐—ท๐˜‚๐˜€๐˜ ๐—ฑ๐—ฟ๐—ผ๐—ฝ๐—ฝ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฒ ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ ๐—ช๐—ฎ๐˜๐—ฒ๐—ฟ๐—บ๐—ฎ๐—ฟ๐—ธ๐—ถ๐—ป๐—ด ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ต๐—ฎ๐˜ ๐—ป๐—ผ๐˜ ๐—ฒ๐—ฑ๐—ถ๐˜ ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฟ๐—ฒ๐—ฎ๐—ธ!๐Ÿ›ก๏ธ

๐Ÿค” Ever heard of watermarking? It's a technique that allows you to mark in an image its original source. It's our best shield against AI-generated deepfakes, or content stolen from artists! ๐ŸŽจ

๐ŸŽญ Watermarking systems are actually a pair of models: a watermark embedder that applies the watermark on the image, and its corresponding decoder that should detect the original watermark.

โ›” But current methods were very limited: they can only apply and detect the watermark on your image as a whole. So, if you're an attacker it's easy to break: just crop it! add text on top! or whatever, really, anything would work to break the watermark.

A team of researchers at Meta was not happy with this. ๐Ÿ˜ค

So to withstand real-world attacks, they decided to make a watermarking model that would also work on any sub-part of the image. It's a real paradigm shift: they consider watermarking not as an image classification task, but as an image segmentation task!

๐Ÿ—๏ธ ๐—”๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ
โ–ธ The "Embedder" (a variational autoencoder + embedder, 1.1M parameters in total) encodes a n-bit message into a watermark signal that is added to the original image
โ–ธ [Only during training] The "Augmenter" randomly distorts the image: masks parts, crops, resizes, compresses. It's basically torture at this point.
โ–ธ The "Extractor" (a vision transformer, or ViT, with 96M parameters) then re-extracts the message from the distorted image, by predicting a (1+n) vector per pixel to predict the watermarked parts and decode corresponding messages.

The performance blows existing models out of the water, they even created new tasks (segmentation-related) just to grok them!

Gerat work @pierrefdz and @tomsander1998 !

Paper here ๐Ÿ‘‰ Watermark Anything with Localized Messages (2411.07231)
reacted to maxiw's post with โค๏ธ๐Ÿš€๐Ÿ”ฅ 7 days ago
view post
Post
4482
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
โค๏ธ: 7048 times
๐Ÿ”ฅ: 5921 times
๐Ÿ‘: 4856 times
๐Ÿš€: 2549 times
๐Ÿค—: 2065 times
ยท
posted an update 7 days ago
view post
Post
3624
๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—ฏ๐—ถ๐—ด ๐˜€๐—ผ๐—ฐ๐—ถ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ ๐Ÿฆ‹, ๐—ถ๐˜'๐˜€ ๐—›๐˜‚๐—ฏ ๐—ฃ๐—ผ๐˜€๐˜๐˜€! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

โš™๏ธ Computed with the great dataset maxiw/hf-posts
โš™๏ธ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
ยท
posted an update 9 days ago
view post
Post
2336
A non-Instruct LLM assistant is mostly useless. ๐Ÿง

Since it's mostly a model trained to complete text, when you ask it a question like "What to do during a stopover in Paris?", it can just go on and on adding more details to your question instead of answering, which would be valid to complete text from its training corpus, but not to answer questions.

โžก๏ธ So the post-training stage includes an important Instruction tuning step where you teach your model how to be useful : answer questions, be concise, be polite... RLHF is a well known technique for this.

For people interested to understand how this step works, the folks at Adaptive ML have made a great guide!

Read it here ๐Ÿ‘‰ https://www.adaptive-ml.com/post/from-zero-to-ppo
posted an update 10 days ago
view post
Post
3144
๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐—–๐—ผ๐—ฑ๐—ฒ๐—ฟ-๐Ÿฏ๐Ÿฎ๐—•: ๐—ป๐—ฒ๐˜„ ๐—ฏ๐—ฒ๐˜€๐˜-๐—ถ๐—ป-๐—ฐ๐—น๐—ฎ๐˜€๐˜€ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น, ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—š๐—ฃ๐—ง-๐Ÿฐ๐—ผ ๐—ผ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€!๐Ÿ’ฅ

๐Ÿ’ช It's the first time Open-Source coding model of this size class that clearly matches GPT-4o's coding capabilities!

โœจ Completes the previous two Qwen 2.5 Coder release with 4 new size: 0.5B, 3B, 14B, 32B
๐Ÿ“š Support long context up to 128K (for the 14B and 32B models)
โœ… Drop-in replacement to GPT-4o as a coding assistant on Cursor or for Artifacts!
๐Ÿค— Models available right now on the Hub, under Apache 2.0 license!

They have setup a crazy Artifacts demo, you should go have a look!
๐Ÿ‘‰ Qwen/Qwen2.5-Coder-Artifacts
posted an update 10 days ago
view post
Post
772
๐—”๐—ฟ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐—น๐—ฎ๐˜„๐˜€ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ? ๐—” ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐—œ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ป๐—ผ๐˜‚๐—ป๐—ฐ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฎ๐˜ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ ๐—ถ๐˜€ ๐˜€๐—ฒ๐—ฒ๐—ถ๐—ป๐—ด ๐—ฑ๐—ถ๐—บ๐—ถ๐—ป๐—ถ๐˜€๐—ต๐—ถ๐—ป๐—ด ๐—ฟ๐—ฒ๐˜๐˜‚๐—ฟ๐—ป๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐˜‚๐—ฝ ๐˜๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—š๐—ฃ๐—ง ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€.

๐Ÿ“Š What are scaling laws? These are empiric laws that say "Every time you increase compute spent in training 10-fold, your LLM's performance will go up by a predictable tick". Of course, they apply only if you train your model with the right methods.

The image below illustrates it: they're from a paper by Google, "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", and they show how quality and instruction following of models improve when you scale the model up (which is equivalent to scaling up the compute spent in training).

โžก๏ธ These scaling laws have immense impact: they triggered the largest gold rush ever, with companies pouring billions into scaling up theiur training. Microsoft and OpenAI spent 100B into their "Startgate" mega training cluster, due to start running in 2028.

๐Ÿค” So, what about these reports of scaling laws slowing down?

If they are true, they would mean a gigantic paradigm shift, as the hundreds of billions poured by AI companies into scaling could be a dead-end. โ›”๏ธ

But I doubt it: until the most recent publications, scaling laws showed no signs of weakness, and the researchers at the higher end of the scale-up seems to imply the scaling up continues.

Wait and see!
  • 1 reply
ยท
posted an update 13 days ago
view post
Post
1611
๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ๐—Ÿ๐—ฎ๐—ฏ: ๐—™๐—ถ๐—ฟ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ ๐—บ๐—ผ๐—ฏ๐—ถ๐—น๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐˜€๐—บ๐—ฎ๐—น๐—น, ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ฒ๐—ฑ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ฐ๐—ฎ๐—ป ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ ๐—ฎ ๐—๐—”๐—ฅ๐—ฉ๐—œ๐—ฆ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐—บ๐—ฎ๐—ฟ๐˜๐—ฝ๐—ต๐—ผ๐—ป๐—ฒ ๐Ÿ“ฑ๐Ÿ”ฅ

A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.

They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.

The team built:

๐Ÿ“Šย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically

๐Ÿ“๐Ÿ“ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces

โœ…ย An instruction dataset of 10.5k operation traces for training mobile agents

Key insights:

- ๐Ÿ“ˆ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโ€™s much smaller
- โš™๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.

Read their paper here ๐Ÿ‘‰ AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024)
posted an update 16 days ago
view post
Post
2483
๐—›๐˜‚๐—ป๐˜†๐˜‚๐—ฎ๐—ป-๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ฏ๐˜† ๐—ง๐—ฒ๐—ป๐—ฐ๐—ฒ๐—ป๐˜: ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐— , ๐—ผ๐—ป๐—น๐˜† ๐Ÿฑ๐Ÿฎ๐—• ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€ ๐—ฏ๐˜‚๐˜ ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—ฎ๐— ๐—” ๐Ÿฏ.๐Ÿญ-๐Ÿฐ๐Ÿฌ๐Ÿฑ๐—• ๐—ผ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐—ฎ๐—ฐ๐—ฎ๐—ฑ๐—ฒ๐—บ๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€ ๐Ÿš€

โšก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input

๐Ÿงช Trained on 7T tokens, including 1.5T tokens of synthetic data

๐Ÿ—๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded

๐Ÿ“Š Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters
โ€ฃ Impressive perf on MATH: 77.4

๐Ÿ‹ย Large context length: up to 256K tokens

๐Ÿ”’ License:
โ€ฃ Commercial use allowed, except if your products have >100M monthly active users
โ€ฃ No access in the EU

๐Ÿค—ย Model weights available on HF!

Read the full paper here ๐Ÿ‘‰ย  Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265)
posted an update 19 days ago
view post
Post
1539
๐Ÿง ย  ๐—–๐—Ÿ๐—˜๐—”๐—ฅ: ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜๐—ผ ๐—บ๐—ฎ๐—ธ๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜ ๐˜„๐—ต๐—ฎ๐˜ ๐˜„๐—ฒ ๐˜„๐—ฎ๐—ป๐˜ ๐˜๐—ต๐—ฒ๐—บ ๐˜๐—ผ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜

With privacy concerns rising, we sometimes need our models to "forget" specific information - like a person's data - while keeping everything else intact. Researchers just released CLEAR, the first benchmark to test how well this works with both text and images.

โŒย Bad news: Current methods either fail to truly forget or end up forgetting way too much. It's like trying to remove a single ingredient from a baked cake!

โœจย But there's hope: Adding simple mathematical constraints (L1 regularization) during the forgetting process significantly improves results.

๐ŸŽฏย Key insights:

โœ…ย The benchmark tests forgetting on 200 fictional personas
โ€ฃ 3,770 visual Q&A pairs
โ€ฃ 4,000 textual Q&A pairs
โ€ฃ Additional real-world tests

๐Ÿ›‘ย Most current forgetting methods don't work well with both text and images
โ€ฃ They either remember what they should forget
โ€ฃ Or they forget too much unrelated information

โœจย Simple mathematical constraints work surprisingly well
โ€ฃ L1 regularization prevents excessive forgetting
โ€ฃ Works especially well with the LLMU method

๐Ÿ‘‰ย Read the full paper here: CLEAR: Character Unlearning in Textual and Visual Modalities (2410.18057)
posted an update 20 days ago
view post
Post
2334
> Oasis: First Real-Time Video Game Without a Game Engine! ๐ŸŽฎ

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

โšก๏ธ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

โš™๏ธ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
โšก๏ธ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
โ€ฃ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

๐ŸŽฎ Features real game mechanics
โ€ฃ Movement, jumping, item management
โ€ฃ Physics and lighting
โ€ฃ Procedurally generated worlds

โš ๏ธ Current limitations
โ€ฃ Blurry graphics at a distance
โ€ฃ Objects sometimes change appearance
โ€ฃ Memory issues in long sessions

Try it yourself, the playable demo is impressive! ๐Ÿ‘‰ https://oasis.decart.ai/welcome
Code ๐Ÿ‘‰ https://github.com/etched-ai/open-oasis
Read it in full ๐Ÿ‘‰ https://oasis-model.github.io/
posted an update 24 days ago
view post
Post
1821
I'm very proud to have supported @CGIAR and @Digigreen in making http://Farmer.chat, an app that supports 20k smallholder farmers on a daily basis ๐ŸŒพ

There are ~500 million smallholder farmers globally, playing a critical role in global food security. Having access to accurate information is essential for them.

๐Ÿ’ฌ An โ€œagricultural extension serviceโ€ offers technical advice on agriculture, and also supplies farmers with the necessary inputs and services to support their agricultural production.

But agriculture extension agents are not in large enough numbers to cope with all the requests, especially in countries like Kenya, India, Ethiopia, and Nigeria.

๐Ÿš€ So the team set out to build an app called http://Farmer.Chat, to provide an agricultural extension service, by building on the immense knowledge accumulated by CGIAR.

โœจ The app is technically impressive: behind the Whatsapp-type UX, an agent interprets the user's intent, and identifies which tool to call to best answer their request: weather API, RAG on a CGIAR-provided knowledge base, market data, etc. The RAG on the knowledge base is in itself a work of art.

๐ŸŽฏ A key part of building such a complex system is to be able to evaluate it properly. During our bi-weekly sessions with the team, I could support them in implementing the method called "LLM-as-a-judge" to tackle this problem.

It worked really well : thanks to the amazing work of the team, the app now successfully answered over 300 thousand requests, in 6 different languages, and it keeps growing!

โžก๏ธ @Vinsingh , @rajgreen and I just wrote a blog post to describe how the app works, especially the LLM-as-a-judge system!

Read it here ๐Ÿ‘‰ https://huggingface.co/blog/digital-green-llm-judge
posted an update 28 days ago
view post
Post
1936
๐ŸŒŸ๐ŸŒŽ Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !

How did they manage to beat top contenders while also adding 23 languages?

๐Ÿ”„ ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป ๐—ผ๐—ป ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ:
โ€ข Synthetic data has been said to cause model-collapse after too much training
โ€ข Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher
โ€ข First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt
โžก๏ธ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!

๐Ÿงฉ ๐—จ๐˜€๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—บ๐—ฒ๐—ฟ๐—ด๐—ถ๐—ป๐—ด: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them!
โ€ข Maximize diversity between merged checkpoints by training each on different language families.
โ€ข Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent!
โžก๏ธ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale

โšก๏ธ ๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ: Automatic evaluations on Arena-Hard-Auto dataset:
โžก๏ธ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6%
โžก๏ธ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size)
โ€ข โš ๏ธ But this performance eval comes from only one benchmark! Let's wait for Open LLM leaderboard evals;

๐Ÿ”’ CC by NC license

Blog post here: https://huggingface.co/blog/aya-expanse
posted an update about 1 month ago
view post
Post
858
๐—›๐—ผ๐˜„ ๐˜๐—ผ ๐—ฟ๐—ฒ-๐—ฟ๐—ฎ๐—ป๐—ธ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐—ป๐—ถ๐—ฝ๐—ฝ๐—ฒ๐˜๐˜€ ๐—ถ๐—ป ๐—ฅ๐—”๐—š โ‡’ ColBERT, Rerankers, Cross-Encoders

Letโ€™s say youโ€™re doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.

How can you score similarity between your query and any source document? ๐Ÿค” ๐Ÿ“„ โ†”๏ธ ๐Ÿ“‘

๐Ÿญ. ๐—๐˜‚๐˜€๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฒ๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ถ๐—ป๐—ด๐˜€ : ๐—ก๐—ผ-๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐ŸŽ๏ธ

This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something.
โžก๏ธ Notable examples: Check the top of the MTEB leaderboard!

๐Ÿฎ. ๐—Ÿ๐—ฎ๐˜๐—ฒ-๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐˜๐—ต๐—ถ๐˜€ ๐—ถ๐˜€ ๐—–๐—ผ๐—น๐—•๐—˜๐—ฅ๐—ง ๐Ÿƒ

These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.

This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.

โžก๏ธ Notable examples: ColBERTv2, mxbai-colbert-large-v1, jina-colbert-v2

๐Ÿฏ. ๐—˜๐—ฎ๐—ฟ๐—น๐˜† ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐—–๐—ฟ๐—ผ๐˜€๐˜€-๐—ฒ๐—ป๐—ฐ๐—ผ๐—ฑ๐—ฒ๐—ฟ๐˜€ ๐Ÿ‹๏ธ

Basically you run the concatenated query + document in a model to get a final score.

This is the most accurate, but also the slowest since it gets really long when you have many docs to rerank! And you cannot pre-store embeddings.

โžก๏ธ Notable examples: MixedBread or Jina AI rerankers!

๐Ÿš€ So what you choose is a trade-off between speed and accuracy: I think ColBERT is often a really good choice!

Based on this great post by Jina AI ๐Ÿ‘‰ https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter
posted an update about 1 month ago
view post
Post
1692
By far the coolest release of the day!
> The Open LLM Leaderboard, most comprehensive suite for comparing Open LLMs on many benchmarks, just released a comparator tool that lets you dig into the detail of differences between any models.

Here's me checking how the new Llama-3.1-Nemotron-70B that we've heard so much compares to the original Llama-3.1-70B. ๐Ÿค”๐Ÿ”Ž

Try it out here ๐Ÿ‘‰ open-llm-leaderboard/comparator
  • 2 replies
ยท