Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

m-ric's activity

posted an update 1 day ago
view post
Post
986
๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ๐—Ÿ๐—ฎ๐—ฏ: ๐—™๐—ถ๐—ฟ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—”๐—ป๐—ฑ๐—ฟ๐—ผ๐—ถ๐—ฑ ๐—บ๐—ผ๐—ฏ๐—ถ๐—น๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐˜€๐—บ๐—ฎ๐—น๐—น, ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ฒ๐—ฑ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ฐ๐—ฎ๐—ป ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ ๐—ฎ ๐—๐—”๐—ฅ๐—ฉ๐—œ๐—ฆ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ผ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐—บ๐—ฎ๐—ฟ๐˜๐—ฝ๐—ต๐—ผ๐—ป๐—ฒ ๐Ÿ“ฑ๐Ÿ”ฅ

A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.

They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.

The team built:

๐Ÿ“Šย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically

๐Ÿ“๐Ÿ“ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces

โœ…ย An instruction dataset of 10.5k operation traces for training mobile agents

Key insights:

- ๐Ÿ“ˆ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโ€™s much smaller
- โš™๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.

Read their paper here ๐Ÿ‘‰ AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024)
posted an update 5 days ago
view post
Post
2372
๐—›๐˜‚๐—ป๐˜†๐˜‚๐—ฎ๐—ป-๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ฏ๐˜† ๐—ง๐—ฒ๐—ป๐—ฐ๐—ฒ๐—ป๐˜: ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜€๐˜ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ผ๐—ฝ๐—ฒ๐—ป ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐— , ๐—ผ๐—ป๐—น๐˜† ๐Ÿฑ๐Ÿฎ๐—• ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€ ๐—ฏ๐˜‚๐˜ ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—ฎ๐— ๐—” ๐Ÿฏ.๐Ÿญ-๐Ÿฐ๐Ÿฌ๐Ÿฑ๐—• ๐—ผ๐—ป ๐—บ๐—ผ๐˜€๐˜ ๐—ฎ๐—ฐ๐—ฎ๐—ฑ๐—ฒ๐—บ๐—ถ๐—ฐ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€ ๐Ÿš€

โšก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input

๐Ÿงช Trained on 7T tokens, including 1.5T tokens of synthetic data

๐Ÿ—๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded

๐Ÿ“Š Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters
โ€ฃ Impressive perf on MATH: 77.4

๐Ÿ‹ย Large context length: up to 256K tokens

๐Ÿ”’ License:
โ€ฃ Commercial use allowed, except if your products have >100M monthly active users
โ€ฃ No access in the EU

๐Ÿค—ย Model weights available on HF!

Read the full paper here ๐Ÿ‘‰ย  Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265)
posted an update 7 days ago
view post
Post
1496
๐Ÿง ย  ๐—–๐—Ÿ๐—˜๐—”๐—ฅ: ๐—ณ๐—ถ๐—ฟ๐˜€๐˜ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜๐—ผ ๐—บ๐—ฎ๐—ธ๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜ ๐˜„๐—ต๐—ฎ๐˜ ๐˜„๐—ฒ ๐˜„๐—ฎ๐—ป๐˜ ๐˜๐—ต๐—ฒ๐—บ ๐˜๐—ผ ๐—ณ๐—ผ๐—ฟ๐—ด๐—ฒ๐˜

With privacy concerns rising, we sometimes need our models to "forget" specific information - like a person's data - while keeping everything else intact. Researchers just released CLEAR, the first benchmark to test how well this works with both text and images.

โŒย Bad news: Current methods either fail to truly forget or end up forgetting way too much. It's like trying to remove a single ingredient from a baked cake!

โœจย But there's hope: Adding simple mathematical constraints (L1 regularization) during the forgetting process significantly improves results.

๐ŸŽฏย Key insights:

โœ…ย The benchmark tests forgetting on 200 fictional personas
โ€ฃ 3,770 visual Q&A pairs
โ€ฃ 4,000 textual Q&A pairs
โ€ฃ Additional real-world tests

๐Ÿ›‘ย Most current forgetting methods don't work well with both text and images
โ€ฃ They either remember what they should forget
โ€ฃ Or they forget too much unrelated information

โœจย Simple mathematical constraints work surprisingly well
โ€ฃ L1 regularization prevents excessive forgetting
โ€ฃ Works especially well with the LLMU method

๐Ÿ‘‰ย Read the full paper here: CLEAR: Character Unlearning in Textual and Visual Modalities (2410.18057)
posted an update 8 days ago
view post
Post
2202
> Oasis: First Real-Time Video Game Without a Game Engine! ๐ŸŽฎ

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

โšก๏ธ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

โš™๏ธ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
โšก๏ธ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
โ€ฃ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

๐ŸŽฎ Features real game mechanics
โ€ฃ Movement, jumping, item management
โ€ฃ Physics and lighting
โ€ฃ Procedurally generated worlds

โš ๏ธ Current limitations
โ€ฃ Blurry graphics at a distance
โ€ฃ Objects sometimes change appearance
โ€ฃ Memory issues in long sessions

Try it yourself, the playable demo is impressive! ๐Ÿ‘‰ https://oasis.decart.ai/welcome
Code ๐Ÿ‘‰ https://github.com/etched-ai/open-oasis
Read it in full ๐Ÿ‘‰ https://oasis-model.github.io/
posted an update 12 days ago
view post
Post
1801
I'm very proud to have supported @CGIAR and @Digigreen in making http://Farmer.chat, an app that supports 20k smallholder farmers on a daily basis ๐ŸŒพ

There are ~500 million smallholder farmers globally, playing a critical role in global food security. Having access to accurate information is essential for them.

๐Ÿ’ฌ An โ€œagricultural extension serviceโ€ offers technical advice on agriculture, and also supplies farmers with the necessary inputs and services to support their agricultural production.

But agriculture extension agents are not in large enough numbers to cope with all the requests, especially in countries like Kenya, India, Ethiopia, and Nigeria.

๐Ÿš€ So the team set out to build an app called http://Farmer.Chat, to provide an agricultural extension service, by building on the immense knowledge accumulated by CGIAR.

โœจ The app is technically impressive: behind the Whatsapp-type UX, an agent interprets the user's intent, and identifies which tool to call to best answer their request: weather API, RAG on a CGIAR-provided knowledge base, market data, etc. The RAG on the knowledge base is in itself a work of art.

๐ŸŽฏ A key part of building such a complex system is to be able to evaluate it properly. During our bi-weekly sessions with the team, I could support them in implementing the method called "LLM-as-a-judge" to tackle this problem.

It worked really well : thanks to the amazing work of the team, the app now successfully answered over 300 thousand requests, in 6 different languages, and it keeps growing!

โžก๏ธ @Vinsingh , @rajgreen and I just wrote a blog post to describe how the app works, especially the LLM-as-a-judge system!

Read it here ๐Ÿ‘‰ https://huggingface.co/blog/digital-green-llm-judge
posted an update 16 days ago
view post
Post
1912
๐ŸŒŸ๐ŸŒŽ Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !

How did they manage to beat top contenders while also adding 23 languages?

๐Ÿ”„ ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป ๐—ผ๐—ป ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ:
โ€ข Synthetic data has been said to cause model-collapse after too much training
โ€ข Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher
โ€ข First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt
โžก๏ธ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!

๐Ÿงฉ ๐—จ๐˜€๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—บ๐—ฒ๐—ฟ๐—ด๐—ถ๐—ป๐—ด: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them!
โ€ข Maximize diversity between merged checkpoints by training each on different language families.
โ€ข Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent!
โžก๏ธ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale

โšก๏ธ ๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ: Automatic evaluations on Arena-Hard-Auto dataset:
โžก๏ธ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6%
โžก๏ธ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size)
โ€ข โš ๏ธ But this performance eval comes from only one benchmark! Let's wait for Open LLM leaderboard evals;

๐Ÿ”’ CC by NC license

Blog post here: https://huggingface.co/blog/aya-expanse
posted an update 22 days ago
view post
Post
846
๐—›๐—ผ๐˜„ ๐˜๐—ผ ๐—ฟ๐—ฒ-๐—ฟ๐—ฎ๐—ป๐—ธ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐—ป๐—ถ๐—ฝ๐—ฝ๐—ฒ๐˜๐˜€ ๐—ถ๐—ป ๐—ฅ๐—”๐—š โ‡’ ColBERT, Rerankers, Cross-Encoders

Letโ€™s say youโ€™re doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.

How can you score similarity between your query and any source document? ๐Ÿค” ๐Ÿ“„ โ†”๏ธ ๐Ÿ“‘

๐Ÿญ. ๐—๐˜‚๐˜€๐˜ ๐˜‚๐˜€๐—ฒ ๐—ฒ๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ถ๐—ป๐—ด๐˜€ : ๐—ก๐—ผ-๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐ŸŽ๏ธ

This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something.
โžก๏ธ Notable examples: Check the top of the MTEB leaderboard!

๐Ÿฎ. ๐—Ÿ๐—ฎ๐˜๐—ฒ-๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐˜๐—ต๐—ถ๐˜€ ๐—ถ๐˜€ ๐—–๐—ผ๐—น๐—•๐—˜๐—ฅ๐—ง ๐Ÿƒ

These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.

This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.

โžก๏ธ Notable examples: ColBERTv2, mxbai-colbert-large-v1, jina-colbert-v2

๐Ÿฏ. ๐—˜๐—ฎ๐—ฟ๐—น๐˜† ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐—–๐—ฟ๐—ผ๐˜€๐˜€-๐—ฒ๐—ป๐—ฐ๐—ผ๐—ฑ๐—ฒ๐—ฟ๐˜€ ๐Ÿ‹๏ธ

Basically you run the concatenated query + document in a model to get a final score.

This is the most accurate, but also the slowest since it gets really long when you have many docs to rerank! And you cannot pre-store embeddings.

โžก๏ธ Notable examples: MixedBread or Jina AI rerankers!

๐Ÿš€ So what you choose is a trade-off between speed and accuracy: I think ColBERT is often a really good choice!

Based on this great post by Jina AI ๐Ÿ‘‰ https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter
posted an update 23 days ago
view post
Post
1678
By far the coolest release of the day!
> The Open LLM Leaderboard, most comprehensive suite for comparing Open LLMs on many benchmarks, just released a comparator tool that lets you dig into the detail of differences between any models.

Here's me checking how the new Llama-3.1-Nemotron-70B that we've heard so much compares to the original Llama-3.1-70B. ๐Ÿค”๐Ÿ”Ž

Try it out here ๐Ÿ‘‰ open-llm-leaderboard/comparator
  • 2 replies
ยท
posted an update 26 days ago
view post
Post
710
โšก๏ธ ๐“๐ก๐ข๐ฌ ๐ฆ๐จ๐ง๐ญ๐ก'๐ฌ ๐ฆ๐จ๐ฌ๐ญ ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ๐š๐ง๐ญ ๐›๐ซ๐ž๐š๐ค๐ญ๐ก๐ซ๐จ๐ฎ๐ ๐ก: ๐ƒ๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ๐ข๐š๐ฅ ๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ ๐ฏ๐š๐ฌ๐ญ๐ฅ๐ฒ ๐ข๐ฆ๐ฉ๐ซ๐จ๐ฏ๐ž๐ฌ ๐š๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง โ‡’ ๐›๐ž๐ญ๐ญ๐ž๐ซ ๐ซ๐ž๐ญ๐ซ๐ข๐ž๐ฏ๐š๐ฅ ๐š๐ง๐ ๐Ÿ๐ž๐ฐ๐ž๐ซ ๐ก๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ!

Thought that self-attention could not be improved anymore?

Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿง  Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns

๐Ÿ”ฅ DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens

๐Ÿ“ Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively

๐Ÿ”Ž Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations ๐Ÿคฏ

๐Ÿ”ข Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!

โš™๏ธ Can be directly implemented using existing FlashAttention kernels

This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.

But they didnโ€™t release weights on the Hub: letโ€™s wait for the community to train the first open-weights DiffTransformer! ๐Ÿš€

Read their paper ๐Ÿ‘‰ย  Differential Transformer (2410.05258)
posted an update about 1 month ago
view post
Post
2889
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash โšก๏ธ

New player entered the game! Rhymes AI has just been announced, and unveiled Aria โ€“ a multimodal powerhouse that's punching above its weight.

Key insights:

๐Ÿง  Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.

๐ŸŒˆ Multimodal: text/image/video โ†’ text.

๐Ÿ“š Novel training approach: โ€œmultimodal-nativeโ€ where multimodal training starts directly during pre-training, not just tacked on later

๐Ÿ“ Long 64K token context window

๐Ÿ”“ Apache 2.0 license, with weights, code, and demos all open

โšก๏ธ On the benchmark side, Aria leaves some big names in the dust.

- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista.
- It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.

But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called โ€œBeagoโ€. Itโ€™s handling even recent events with great accuracy!

And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.

Read their paper for Aria ๐Ÿ‘‰ย  Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993)

Try BeaGo ๐Ÿถ ๐Ÿ‘‰ย https://rhymes.ai/blog-details/introducing-beago-your-smarter-faster-ai-search
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
2240
๐Ÿ’ฅ ๐‹-๐Œ๐ฎ๐ฅ: ๐€๐๐๐ข๐ญ๐ข๐จ๐ง-๐Ž๐ง๐ฅ๐ฒ ๐Œ๐ฎ๐ฅ๐ญ๐ข๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง ๐œ๐š๐ง ๐ฌ๐ฅ๐š๐ฌ๐ก ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐š๐ญ๐ข๐จ๐ง๐š๐ฅ ๐œ๐จ๐ฌ๐ญ๐ฌ ๐›๐ฒ ๐Ÿ–๐ŸŽ%!

Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.

๐Ÿ’ก Quick reminder on how floats are coded on 8 bits (FP8):
In the e4m3 FP8 standard, you encode a number as:
Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits)
Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625)
Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):

โžก๏ธย You get (1 + 0.625) ร— 2^(8-7) = 3.25

Now back to the paper. ๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

โšก๏ธ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!

๐Ÿงฎ Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) ยท 2^xe ยท (1 + ym) ยท 2^ye = (1 + xm + ym + xm ยท ym) ยท 2^(xe+ye)

๐Ÿ’ก L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) ยท 2^(xe+ye), eliminating the costly xm ยท ym term

๐Ÿ”ง l(m) term is adaptively set based on mantissa size for optimal accuracy

๐Ÿ“Š Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision

๐Ÿ’ฌ Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."

This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but itโ€™s a really exciting path for more sustainable AI! ๐ŸŒฑ

Read the paper here ๐Ÿ‘‰ย  Addition is All You Need for Energy-efficient Language Models (2410.00907)
posted an update about 1 month ago
view post
Post
3024
๐Ÿ“œ ๐Ž๐ฅ๐-๐ฌ๐œ๐ก๐จ๐จ๐ฅ ๐‘๐๐๐ฌ ๐œ๐š๐ง ๐š๐œ๐ญ๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ซ๐ข๐ฏ๐š๐ฅ ๐Ÿ๐š๐ง๐œ๐ฒ ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
โถ Removed dependencies on previous hidden states in the gates
โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

โšก๏ธ As a result, you can use a โ€œparallel scanโ€ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

๐Ÿ”ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐Ÿš€

๐Ÿค” Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

๐Ÿ’ฌย Franรงois Chollet wrote in a tweet about this paper:

โ€œThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ€

โ€œCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ€

Itโ€™s the Bitter lesson by Rich Sutton striking again: donโ€™t need fancy thinking architectures, just scale up your model and data!

Read the paper ๐Ÿ‘‰ย  Were RNNs All We Needed? (2410.01201)
  • 2 replies
ยท
reacted to MoritzLaurer's post with ๐Ÿค—โค๏ธ about 1 month ago
view post
Post
3811
#phdone - I defended my PhD yesterday! A key lesson: it is amazing how open science and open source can empower beginners with limited resources:

I first learned about instruction-based classifiers like BERT-NLI 3-4 years ago, through the @HuggingFace ZeroShotClassificationPipeline. Digging deeper into this, it was surprisingly easy to find new datasets, newer base models, and reusable fine-tuning scripts on the HF Hub to create my own zeroshot models - although I didn't know much about fine-tuning at the time.

Thanks to the community effect of the Hub, my models were downloaded hundreds of thousands of times after a few months. Seeing my research being useful for people motivated me to improve and upload newer models. Leaving my contact details in the model cards led to academic cooperation and consulting contracts (and eventually my job at HF).

That's the power of open science & open source: learning, sharing, improving, collaborating.

I mean every word in my thesis acknowledgments (screenshot). I'm very grateful to my supervisors @vanatteveldt @CasAndreu @KasperWelbers for their guidance; to @profAndreaRenda and @CEPS_thinktank for enabling me to work part-time during the first year; to @huggingface for creating awesome tools and an awesome platform; and to many others who are not active on social media.

Links to the full thesis and the collection of my most recent models are below.

PS: If someone happens to speak Latin, let me know if my diploma contains some hidden Illuminati code or something :D
ยท
posted an update about 1 month ago
view post
Post
1324
๐Ÿ‡จ๐Ÿ‡ณโ›ต๏ธ ๅ‡บๆตท: Chinese AI is expanding globally

Fact: Chinese LLMs are heavily underrated, for instance recently the excellent Deepseek-v2.5 or Qwen models.

Luckily for us, @AdinaY just wrote an excellent blog post explaining the Chinese AI ecosystem!

My key takeaways:

Since Google, OpenAI and Anthropic models are not available in China, local companies are fighting for the market. A really good market - AI has much higher penetration there than in the rest of the world, both with companies and individual users!

๐Ÿ’ฐ But since Deepseek heavily cut prices in May 24, this spiraled into a price war that created a cut-throat environment with unsustainably low prices.

๐Ÿ“‹ On top of this, the local regulation is stringent: models must undergo licensing from a local censor (the Cyberspace Administration of China), that for instance requires models to refuse answering certain questions on the CCP. Although this is certainly simpler to implement than certain condition of the European AI Act.

๐Ÿ’ธ If this wasn't enough, VC investment in AI is drying out: By mid-2024, Chinese AI startups raised approximately $4.4 billion, vs $55B for US startups just in Q2 24.

๐Ÿ“ฑ To get profitability companies have shifted from foundational models to model + application, for instance PopAI from [01.AI](http://01.ai/) with millions of users and high profitability.

โ›๏ธ They also try to drill down specific industries: but these niches are also getting crowded.

โžก๏ธ Since their home market is becoming both too crowded and unhospitable, Chinese companies are now going for international market, "Sailing abroad" following the expression consacred for Zheng He's legendary journey in 1500.

There, they'll have to adapt to different infrastructures and regulations, but they have bright prospects for growth!

Read her post ๐Ÿ‘‰ย https://huggingface.co/blog/AdinaY/chinese-ai-global-expansion
posted an update about 1 month ago
view post
Post
1162
Emu3: Next-token prediction conquers multimodal tasks ๐Ÿ”ฅ

This is the most important research in months: weโ€™re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.

๐—ช๐—ต๐—ฎ๐˜'๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ถ๐—ด ๐—ฑ๐—ฒ๐—ฎ๐—น?
๐ŸŒŸ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And itโ€™s only 8B, but really strong:
๐Ÿ–ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL.
๐Ÿ‘๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
๐ŸŽฌ It's the first to nail video generation without using complicated diffusion techniques.

๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ถ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ?
๐Ÿงฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
๐Ÿ”— Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
๐Ÿ”ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.

๐—–๐—ฎ๐˜ƒ๐—ฒ๐—ฎ๐˜๐˜€ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
๐Ÿ‘‰ In image generation, Emu3 beats SDXL, but itโ€™s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
๐Ÿ‘‰ In vision, authors also donโ€™t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.

This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!

Read the paper ๐Ÿ‘‰ Emu3: Next-Token Prediction is All You Need (2409.18869)
posted an update about 1 month ago
view post
Post
1271
๐—”๐—ฑ๐—ฑ ๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐—ต๐—ถ๐—ด๐—ต๐—น๐—ถ๐—ด๐—ต๐˜๐—ถ๐—ป๐—ด ๐˜๐—ผ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฅ๐—”๐—š ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ! ๐Ÿ“„๐Ÿ’ก

RAG systems are supposed to make your LLM's answer more trustworthy, by inserting in the prompt some supporting documents from a knowledge base : we say that we're "adding some context".

๐Ÿ‘Ž But if you don't know which part of the answer has been generated based on which input tokens, it's hard to tell wether it was effectively grounded in the context knowledge or not!

๐Ÿค” I've been working on the question: is it possible to add notes to the answer linking to which part of the context they're generated from?

And I've found a great solution: a great technique called Layer-wise Relevance Propagation (LRP), showcased in a paper at ICML `24 by Reduan Achtibat et al allows, allows to precisely score how important each input token was in generating your output! They've made it into a library called LXT.

๐Ÿ“Š For each generated output token, LXT gives you attribution scores for each input token.

โš™๏ธ So I've worked a bit more on aggregating these scores into meaningful spans between successive input and output tokens, and I finally obtained my desired result: RAG with source highlighting!

Try the demo here ๐Ÿ‘‰ m-ric/rag_highlights

Caveats:
- It slows down generation (for now quite a lot, could hopefully be reduced a lot)
- For now it supports only specific models: Llama models and Mixtral

If there's enough interest in this solution, I can improve it further and spin it off into a specific library for RAG! ๐Ÿš€
posted an update about 1 month ago
view post
Post
1482
Transformers v4.45.0 released: includes a lightning-fast method to build tools! โšก๏ธ

During user research with colleagues @MoritzLaurer and @Jofthomas , we discovered that the class definition currently in used to define a Tool in
transformers.agents is a bit tedious to use, because it goes in great detail.

โžก๏ธ So Iโ€™ve made an easier way to build tools: just make a function with type hints + a docstring, and add a @tool decorator in front.

โœ…ย Voilร , youโ€™re good to go!

Read all about it in the new doc here: https://huggingface.co/docs/transformers/main/en/agents#create-a-new-tool

And donโ€™t hesitate to give feedback, Iโ€™m all ears! ๐Ÿค—
posted an update about 2 months ago
view post
Post
3204
๐ŸŒŽ ๐“๐ก๐ž ๐Ÿ๐ข๐ซ๐ฌ๐ญ ๐ž๐ฏ๐ž๐ซ ๐…๐จ๐ฎ๐ง๐๐š๐ญ๐ข๐จ๐ง ๐ฐ๐ž๐š๐ญ๐ก๐ž๐ซ ๐ฆ๐จ๐๐ž๐ฅ: ๐๐ซ๐ข๐ญ๐ก๐ฏ๐ข ๐–๐ฑ๐‚ ๐ž๐ง๐š๐›๐ฅ๐ž๐ฌ ๐ฅ๐ข๐Ÿ๐ž-๐ฌ๐š๐ฏ๐ข๐ง๐  ๐ฐ๐ž๐š๐ญ๐ก๐ž๐ซ ๐ฉ๐ซ๐ž๐๐ข๐œ๐ญ๐ข๐จ๐ง๐ฌ

Hurricane Katrina killed hundreds of people as it made landfall on New Orleans in 2005 - many of these deaths could have been avoided if alerts had been given one day earlier. Accurate weather forecasts are really life-saving.

๐Ÿ”ฅย Now, NASA and IBM just dropped a game-changing new model: the first ever foundation model for weather! This means, it's the first time we have a generalist model not restricted to one task, but able to predict 160 weather variables!

Prithvi WxC (Prithvi, โ€œเคชเฅƒเคฅเฅเคตเฅ€โ€, is the Sanskrit name for Earth) - is a 2.3 billion parameter model, with an architecture close to previous vision transformers like Hiera.

๐Ÿ’กย But it comes with some important tweaks: under the hood, Prithvi WxC uses a clever transformer-based architecture with 25 encoder and 5 decoder blocks. It alternates between "local" and "global" attention to capture both regional and global weather patterns.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:
๐Ÿ”ฎ Nails short-term forecasts - Prithvi WxC crushed it on 6-12 hour predictions, even outperforming some traditional numerical weather models
๐ŸŒ€ Tracks hurricanes like a champ - For Hurricane Ida, it predicted the landfall location within 5 km (vs 20+ km errors from other AI models), which is a huge progress!
๐Ÿ” 6x downscaling power - Can zoom in on weather data to 6x higher resolution with 4x lower error than basic methods
๐ŸŒŠ Models elusive gravity waves - Accurately simulates these crucial but hard-to-capture atmospheric oscillations

As climate change intensifies, tools like Prithvi WxC will become more and more crucial to avoid disasters!

Announcement post ๐Ÿ‘‰ https://newsroom.ibm.com/2024-09-23-ibm-and-nasa-release-open-source-ai-model-on-hugging-face-for-weather-and-climate-applications

Model on the Hub ๐Ÿ‘‰ https://huggingface.co/Prithvi-WxC

Thank you @clem for highlighting it!
posted an update about 2 months ago
view post
Post
1064
๐Ÿง  Stanford paper might be the key to OpenAI o1โ€™s performance: Whatโ€™s so effective about Chain of Thought? โ‡’ it unlocks radically different sequential tasks!

๐Ÿ’ญย Reminder: A Chain of Thought (CoT) means that you instruct the model to โ€œthink step by stepโ€. Often itโ€™s literally just putting in the prompt โ€œletโ€™s think step by step.โ€

๐Ÿค”ย This method has been shown to be unreasonably effective to increase perf on benchmarks. However why it works so well remains unclear.

Here's the scoop: Transformers are amazing at parallel processing, but they've always struggled with tasks that require sequential reasoning.

โ›”๏ธ For instance if you ask them the result of 3^2^2^2^โ€ฆ, with 20 iterations, theyโ€™ll nearly always fail.

๐Ÿ’กย Indeed, researchers prove mathematically, by assimilating transformers networks to logical circuits, that effectively they cannot solve sequential tasks that require more than a certain threshold of sequences.

But CoT enables sequential reasoning:

- ๐Ÿงฑ Each step in the CoT corresponds to simulating one operation in a complex circuit.
- ๐Ÿ”„ This allows the transformer to "reset" the depth of intermediate outputs, overcoming previous limitations.
- ๐Ÿš€ Thus, with CoT, constant-depth transformers can now solve ANY problem computable by polynomial-size circuits! (That's a huge class of problems in computer science.)
- ๐Ÿ”‘ Transformers can now handle tricky tasks like iterated squares (computing 3^2^2^2^2) composed permutations and evaluating circuits - stuff that requires serial computation.
- ๐Ÿ“Šย The improvement is especially dramatic for transformers with a limited depth. Empirical tests on four arithmetic problems showed massive accuracy gains with CoT on inherently serial tasks.

Main takeaway: Chain-of-thought isn't just a neat trick - it fundamentally expands what transformer models can do!

Read the paper ๐Ÿ‘‰ย  Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875)