fblgit (FBL)

reacted to thomwolf's post with 🔥 10 months ago

Post

1561

Very exciting new mistralai/Pixtral-Large-Instruct-2411 model from Mistral-AI

Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!

Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model

posted an update 10 months ago

Post

1363

Introducing miniclaus 1.5B, a tiny but powerful model. Trained with MagPie and based on Qwen2.5 1.5B model, it performs very well on many tasks scoring top on his category, with impressive results:
* MATH Hard 9.81
* MMLU-Pro 29.37
* GPQA 29.19
* MUSR 42.85
* BBH 42.04

Available already in the hub:
fblgit/miniclaus-qw1.5B-UNAMGS

posted an update 10 months ago

Post

825

Cybertron is back:

We released today a newest version of Cybertron: V4 based on Qwen2.5 7B and trained on MagPie. Scoring #1 LLM on 7B & 8B class.

The model hasn't go thru DPO, so the weights are in good shape to welcome further training sessions and optimizations.
Enjoy it in the hub as usual:
fblgit/cybertron-v4-qw7B-MGS

1 reply

·

replied to m-ric's post 11 months ago

Still not being able to get those impressive marks, trying to reproduce something simple with wikitext.. not much "performance" out of it.
Anyone has made this to work and get positive results?

reacted to m-ric's post with 👍 11 months ago

Post

3092

📜 𝐎𝐥𝐝-𝐬𝐜𝐡𝐨𝐨𝐥 𝐑𝐍𝐍𝐬 𝐜𝐚𝐧 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐫𝐢𝐯𝐚𝐥 𝐟𝐚𝐧𝐜𝐲 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
❶ Removed dependencies on previous hidden states in the gates
❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! 🚀

🤔 Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

💬 François Chollet wrote in a tweet about this paper:

“The fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)”

“Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.”

It’s the Bitter lesson by Rich Sutton striking again: don’t need fancy thinking architectures, just scale up your model and data!

Read the paper 👉 Were RNNs All We Needed? (2410.01201)

2 replies

·

replied to their post about 1 year ago

latest 3.5 version of Claude model is even more impressive.. like SEVERAL problems (AI/ML) basically torch, where GPT4o fails epically.. were solved by Claude in 0-Shot.
But also to be said, GPT4o is very impressive using its sandbox.. kudos to that!

posted an update over 1 year ago

Post

2617

Introducing UNA-ThePitbull Series

We are happy to announce the release of our latest model UNA-ThePitbull, the most powerful model below 70B in the industry. In this new generation, inspired on our previous Beagle series we curated a model that balance nicely EQ and IQ. It was trained with some of the latest datasets including:
* Replete-AI/code_bagel_hermes-2.5
* mlabonne/orpo-dpo-mix-40k
* jondurbin/py-dpo-v0.1
Available in the hub fblgit/UNA-ThePitbull-21.4B-v2 and you can grab Quant versions sponsored by @bartowski at bartowski/UNA-ThePitbull-21.4B-v2-GGUF fully compatible with Ollama, llama.cpp, etc.

UNA
In this case we tried something new by alternating uniformity across layers of both MLP & Attention reducing computational requirements while keep a high performant result.

We trained him under these terms:
* ThePitbull-v1 as base: SFT maxLR 1e-4 minLR 5e-5 for 1 Epoch
* DPO maxLR 1e-4 minLR 5e-5 for 1 Epoch
You can continue the training by merely using 5e-5 maxLR and 0 warmup steps, it should minimize catastrophic forgetting of the model.

Remember if you do so, please include a Pitbull picture on your model and cite :) Have fun!

posted an update over 1 year ago

Post

Over the past week, I've been putting Claude through its paces, focusing primarily on productivity tasks (you know, the good old BAU – Business As Usual).

1. Python/Torch/Transformers/AI/ML
Right off the bat, I threw some complex AI/ML tasks at Claude, and I must say, it handled them with finesse. It even caught a few things that GPT missed! However, let's not get too carried away – we're not quite at the auto-code level just yet.

2. Brainstorming
This is where Claude falls a bit short. It seems to be more grounded than its competitors, which might not be ideal for generating novel ideas. If you're looking for a brainstorming partner, you might want to look elsewhere.

3. Attention
Despite the claims of super-large attention in the paper, Claude's "forgetting" mechanism seems to be more pronounced. It tends to miss entire chunks of information rather than just specific details like GPT does.

4. Following / Tasks
I hit a roadblock when Claude couldn't generate a LaTeX document. It's not the best at following complex, multi-step tasks.

5. Hallucinations
Oh boy, does Claude hallucinate! And when it does, it's on a whole new level of nonsense. The hallucinations seem to align with its grounded nature, making them even more convincing within the context of the prompt.

6. Sycophancy
Claude is quite the people-pleaser. I've found that using an adversarial brainstorming approach is more beneficial and time-efficient, as it forces me to highlight Claude's mistakes rather than letting it focus on being a sweet, pleasant minion.

7. Interface / UI
There's definitely room for improvement here. Basic features like stepping back on a prompt and stopping generation with the ESC key are missing. These are essential for extracting and composing content effectively.

Despite these limitations, I firmly believe that Claude is currently the #1

4 replies

·

reacted to akhaliq's post with ❤️ over 1 year ago

Post

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

reacted to santiviquez's post with 👍 over 1 year ago

Post

Where I work, we are obsessed with what happens to a model's performance after it has been deployed. We call this post-deployment data science.

Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.

How can we detect Concept Drift? 🤔

All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. 🧠

This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.

❓But how do we know if there is a new Concept in our data?
❓Or, more important, how do we measure if the new Concept is affecting the model's performance?

💡 We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.

👣 Step-by-Step solution:

1️⃣ We start by training an internal model on a chunk of the latest data. ➡️ This allows us to learn the new possible Concept presented in the data.

2️⃣ Next, we use the internal model to make predictions on the reference dataset.

3️⃣ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.

4️⃣ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.

To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. ➡️ This is what the plot below shows. The change of the F1-score due to Concept drift! 🚨

This process is repeated for every new chunk of data that we get. 🔁

reacted to their post with 🤗 over 1 year ago

Post

Senku-70B stills undefeated within EQ-Bench, latest updates from the author shows even a further increase in performance, reaching a new score of 85.09

This new mark outperform some GPT-4 models, closing further the very thin gap between OpenCommunity LLM and Closed source models.

ShinojiResearch/Senku-70B-Full

1 reply

·

posted an update over 1 year ago

Post

Senku-70B stills undefeated within EQ-Bench, latest updates from the author shows even a further increase in performance, reaching a new score of 85.09

This new mark outperform some GPT-4 models, closing further the very thin gap between OpenCommunity LLM and Closed source models.

ShinojiResearch/Senku-70B-Full

1 reply

·

replied to their post over 1 year ago

UNA is a modification of the modeling_$model.py of transformers. I port it to to the different transformer version and models, keeping it clean and performant, So it works with any of these frameworks like #axolotl

reacted to their post with ❤️ over 1 year ago

Post

Introducing UNA-SimpleSmaug-34b:

Based on Smaug-34B-v0.1, capable of slightly outperform his base model and with increased math and reasoning thanks to simple-math dataset.
The model exhibits a great performance across diverse tasks with an excellent and balanced behaviour.
It scores 77.41 AVG on the Leaderboard, landing on #1 Position of 34B models.

Available in the hub already:
fblgit/UNA-SimpleSmaug-34b-v1beta
fblgit/simple-math

In this case, we applied UNA to the Attention Layers of the model while performing SFT with simple-math on a high complexity generated data of mathematics, proving the effect of simple-math on LLM's.

2 replies

·

posted an update over 1 year ago

Post

Introducing UNA-SimpleSmaug-34b:

Based on Smaug-34B-v0.1, capable of slightly outperform his base model and with increased math and reasoning thanks to simple-math dataset.
The model exhibits a great performance across diverse tasks with an excellent and balanced behaviour.
It scores 77.41 AVG on the Leaderboard, landing on #1 Position of 34B models.

Available in the hub already:
fblgit/UNA-SimpleSmaug-34b-v1beta
fblgit/simple-math

In this case, we applied UNA to the Attention Layers of the model while performing SFT with simple-math on a high complexity generated data of mathematics, proving the effect of simple-math on LLM's.

2 replies

·

reacted to their post with ❤️ over 1 year ago

Post

Introducing model-similarities, a new simple tool to contrast two models

A straightforward yet insightful tool designed to shed light on the similarities between various models. Discover it now at [Model Similarity GitHub Repository](https://github.com/fblgit/model-similarity).

This project is in its nascent stages, and we're eager for contributions and enhancements. Crafted with simplicity at its core, the tool performs two primary comparisons:
- Weight similarities, utilizing a simple approach to contrast vector differences (A != B).
- Cosine similarity between the parameters of models A and B, providing a nuanced measure of their alignment.

Included in the repository are sample analyses and reports that validate model card claims, particularly regarding the training specifics of transformer components such as MLP, Attention, etc. Remarkably, these samples reveal 100% similarity scores between those parts of the models, pinpointing the exact base model utilized.

Join us in refining and expanding this tool. Whether you're looking to contribute code, ideas, or both, your input will help transform this into a resource for everyone.

posted an update over 1 year ago

Post

Introducing model-similarities, a new simple tool to contrast two models

A straightforward yet insightful tool designed to shed light on the similarities between various models. Discover it now at [Model Similarity GitHub Repository](https://github.com/fblgit/model-similarity).

This project is in its nascent stages, and we're eager for contributions and enhancements. Crafted with simplicity at its core, the tool performs two primary comparisons:
- Weight similarities, utilizing a simple approach to contrast vector differences (A != B).
- Cosine similarity between the parameters of models A and B, providing a nuanced measure of their alignment.

Included in the repository are sample analyses and reports that validate model card claims, particularly regarding the training specifics of transformer components such as MLP, Attention, etc. Remarkably, these samples reveal 100% similarity scores between those parts of the models, pinpointing the exact base model utilized.

Join us in refining and expanding this tool. Whether you're looking to contribute code, ideas, or both, your input will help transform this into a resource for everyone.

reacted to their post with ❤️ over 1 year ago

Post

Presenting: SimpleMath

Recently we uploaded on the hub our LATEST and most powerful version of SimpleMath SFT dataset.
Today we are happy to present SimpleMath DPO Pairs, improving further mathematical capabilities on LLM's.

Our first results shows clear improvements on GSM8k, MATHQA, ARC, TQA, MMLU and BBH. Feel free to experiment and generate your own dataset, as we also provide the code to generate them synthetically.

fblgit/simple-math
fblgit/simple-math-DPO
fblgit/UNA-34BeagleSimpleMath-32K-v1

2 replies

·

posted an update over 1 year ago

Post

Presenting: SimpleMath

Recently we uploaded on the hub our LATEST and most powerful version of SimpleMath SFT dataset.
Today we are happy to present SimpleMath DPO Pairs, improving further mathematical capabilities on LLM's.

Our first results shows clear improvements on GSM8k, MATHQA, ARC, TQA, MMLU and BBH. Feel free to experiment and generate your own dataset, as we also provide the code to generate them synthetically.

fblgit/simple-math
fblgit/simple-math-DPO
fblgit/UNA-34BeagleSimpleMath-32K-v1

2 replies

·

replied to ehartford's post over 1 year ago

we working on it my friend, LASER team is awesome. We are investigating further these two together how they amplify. The improvements on performance are larger than the usual tho we are empirically testing such thing.

FBL

AI & ML interests

Organizations

FBL

AI & ML interests

Organizations

fblgit's activity