Andrea Soria

asoria

AI & ML interests

Maintainer of ๐Ÿค—Datasets: Data processing

Recent Activity

updated a dataset 32 minutes ago
asoria/my_dataset

Articles

Organizations

asoria's activity

posted an update 23 days ago
view post
Post
1728
๐Ÿš€ Exploring Topic Modeling with BERTopic ๐Ÿค–

When you come across an interesting dataset, you often wonder:
Which topics frequently appear in these documents? ๐Ÿค”
What is this data really about? ๐Ÿ“Š

Topic modeling helps answer these questions by identifying recurring themes within a collection of documents. This process enables quick and efficient exploratory data analysis.

Iโ€™ve been working on an app that leverages BERTopic, a flexible framework designed for topic modeling. Its modularity makes BERTopic powerful, allowing you to switch components with your preferred algorithms. It also supports handling large datasets efficiently by merging models using the BERTopic.merge_models approach. ๐Ÿ”—

๐Ÿ” How do we make this work?
Hereโ€™s the stack weโ€™re using:

๐Ÿ“‚ Data Source โžก๏ธ Hugging Face datasets with DuckDB for retrieval
๐Ÿง  Text Embeddings โžก๏ธ Sentence Transformers (all-MiniLM-L6-v2)
โšก Dimensionality Reduction โžก๏ธ RAPIDS cuML UMAP for GPU-accelerated performance
๐Ÿ” Clustering โžก๏ธ RAPIDS cuML HDBSCAN for fast clustering
โœ‚๏ธ Tokenization โžก๏ธ CountVectorizer
๐Ÿ”ง Representation Tuning โžก๏ธ KeyBERTInspired + Hugging Face Inference Client with Meta-Llama-3-8B-Instruct
๐ŸŒ Visualization โžก๏ธ Datamapplot library
Check out the space and see how you can quickly generate topics from your dataset: datasets-topics/topics-generator

Powered by @MaartenGr - BERTopic
reacted to celinah's post with โค๏ธ about 1 month ago
view post
Post
1062
๐Ÿ“ฃ ๐š‘๐šž๐š๐š๐š’๐š—๐š๐š๐šŠ๐šŒ๐šŽ_๐š‘๐šž๐š‹ v0.26.0 is out with some new features and improvements!

โœจ ๐—ง๐—ผ๐—ฝ ๐—›๐—ถ๐—ด๐—ต๐—น๐—ถ๐—ด๐—ต๐˜๐˜€:
- ๐Ÿ”ย Multiple access tokens support: Easily manage multiple access tokens with new CLI commands. Perfect for handling multiple tokens with specific permissions in production or when collaborating with external teams.
- ๐Ÿ–ผ๏ธ Conversational VLMs inference is now supported withย InferenceClient's chat completion!
- ๐Ÿ“„ Daily Papers API: Seamlessly search and retrieve detailed paper information from the Hub!

Weโ€™ve also introduced multiple bug fixes and quality-of-life improvements - thanks to the awesome contributions from our community! ๐Ÿค—

Check out the release notes here: Wauplin/huggingface_hub#9

and you can try it out now ๐Ÿ‘‡
pip install huggingface_hub==0.26.0

reacted to davidberenstein1957's post with ๐Ÿ”ฅ about 2 months ago
reacted to anakin87's post with ๐Ÿ‘ about 2 months ago
view post
Post
1693
๐Ÿ•ต๐Ÿป ๐€๐ ๐ž๐ง๐ญ๐ข๐œ ๐‘๐€๐† ๐ฐ๐ข๐ญ๐ก ๐Ÿฆ™ ๐‹๐ฅ๐š๐ฆ๐š 3.2

I was excited to explore Llama 3.2, but as a simple ๐Ÿ‡ช๐Ÿ‡บ EU guy, I don't have access to Meta's multimodal models ๐Ÿ˜ฟ

๐Ÿค” So I thought: why not challenge the small 3B text model with Agentic RAG?

๐ŸŽฏ The plan:
- Build a system that tries to answer questions using a knowledge base.
- If the documents don't contain the answer, use Web search for additional context.


Check out my experimental notebook here: ๐Ÿ““ https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/llama32_agentic_rag.ipynb


My stack:
๐Ÿ—๏ธ haystack (https://haystack.deepset.ai/): open-source LLM orchestration framework
๐Ÿฆ™ meta-llama/Llama-3.2-3B-Instruct
๐Ÿฆ†๐ŸŒ free DuckDuckGo API, integrated with Haystack

โœจ ๐˜›๐˜ฉ๐˜ฆ ๐˜ณ๐˜ฆ๐˜ด๐˜ถ๐˜ญ๐˜ต๐˜ด? ๐˜Œ๐˜ฏ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ข๐˜จ๐˜ช๐˜ฏ๐˜จ - ๐˜ข ๐˜ง๐˜ฆ๐˜ธ ๐˜ฎ๐˜ฐ๐˜ฏ๐˜ต๐˜ฉ๐˜ด ๐˜ข๐˜จ๐˜ฐ, ๐˜ต๐˜ฉ๐˜ช๐˜ด ๐˜ญ๐˜ฆ๐˜ท๐˜ฆ๐˜ญ ๐˜ฐ๐˜ง ๐˜ฑ๐˜ฆ๐˜ณ๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ฏ๐˜ค๐˜ฆ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ข ๐˜ด๐˜ฎ๐˜ข๐˜ญ๐˜ญ ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ ๐˜ธ๐˜ฐ๐˜ถ๐˜ญ๐˜ฅ'๐˜ท๐˜ฆ ๐˜ฃ๐˜ฆ๐˜ฆ๐˜ฏ ๐˜ถ๐˜ฏ๐˜ต๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ๐˜ข๐˜ฃ๐˜ญ๐˜ฆ!
This probably reflects the impressive IFEval score of the model (comparable to Llama 3.1 8B).
posted an update about 2 months ago
view post
Post
2357
๐Ÿ“ I wrote a tutorial on how to get started with the fine-tuning process using Hugging Face tools, providing an end-to-end workflow.

The tutorial covers creating a new dataset using the new SQL Console ๐Ÿ›ข and fine-tuning a model with SFT, guided by the Notebook Creator App ๐Ÿ“™.

๐Ÿ‘‰ You can read the full article here:
https://huggingface.co/blog/asoria/easy-fine-tuning-with-hf
asoria/auto-notebook-creator
posted an update 2 months ago
view post
Post
958
๐Ÿš€ Excited to share the latest update to the Notebook Creator Tool!

Now with basic fine-tuning support using Supervised Fine-Tuning! ๐ŸŽฏ

How it works:
1๏ธโƒฃ Choose your Hugging Face dataset and notebook type (SFT)
2๏ธโƒฃ Automatically generate your training notebook
3๏ธโƒฃ Start fine-tuning with your data!

Link to the app ๐Ÿ‘‰ https://lnkd.in/e_3nmWrB
๐Ÿ’ก Want to contribute with new notebooks? ๐Ÿ‘‰https://lnkd.in/eWcZ92dS
reacted to m-ric's post with ๐Ÿ‘€ 2 months ago
view post
Post
1648
๐—”๐—ฟ๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฒ๐—ป๐—ผ๐˜‚๐—ด๐—ต ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ? โ‡’ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐˜„๐—ถ๐˜๐—ต ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐Ÿ“Š

A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโ€ฆ

โžก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.

๐—ง๐—ต๐—ฒ ๐——๐—ฆ๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜
โ–ช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks
โ–ช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
โ–ช๏ธDifference with previous DS benchmarks:
โถ This benchmark leverages various modalities on top of text: images, Excel files, tables
โท Complex tables: sometimes several tables should be leveraged to answer one question
โธ The context is richer, with longer descriptions.
โ–ช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.

๐—œ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€
โ–ช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
โ–ช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.

This new benchmark is really welcome, can't wait to try transformers agents on it! ๐Ÿค—

Read their full paper ๐Ÿ‘‰ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)
posted an update 3 months ago
view post
Post
816
I've been working on a Space to make it super easy to create notebooks and help users quickly understand and manipulate their data!
With just a few clicks automatically generate notebooks for:

๐Ÿ“Š Exploratory Data Analysis
๐Ÿง  Text Embeddings
๐Ÿค– Retrieval-Augmented Generation (RAG)

โœจ Automatic training is coming soon!
Check it out here asoria/auto-notebook-creator
Appreciate any feedback to improve this tool ๐Ÿค—
reacted to davanstrien's post with ๐Ÿš€ 3 months ago
view post
Post
3152
๐Ÿš€ Introducing Hugging Face Similar: a Chrome extension to find relevant datasets!

โœจ Adds a "Similar Datasets" section to Hugging Face dataset pages
๐Ÿ” Recommendations based on dataset READMEs
๐Ÿ—๏ธ Powered by https://huggingface.co/chromadb and https://huggingface.co/Snowflake embeddings.

You can try it here: https://chromewebstore.google.com/detail/hugging-face-similar/aijelnjllajooinkcpkpbhckbghghpnl?authuser=0&hl=en.

I am very happy to get feedback on whether this could be useful or not ๐Ÿค—
ยท
reacted to m-ric's post with ๐Ÿš€ 4 months ago
view post
Post
2266
๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐——๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐˜: ๐—ฑ๐—ฟ๐—ผ๐—ฝ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ถ๐—น๐—ฒ, ๐—น๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—Ÿ๐—Ÿ๐—  ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐Ÿ“Šโš™๏ธ

Need to make quick exploratory data analysis? โžก๏ธ Get help from an agent.

I was impressed by Llama-3.1's capacity to derive insights from data. Given a csv file, it makes quick work of exploratory data analysis and can derive interesting insights.

On the data from the Kaggle titanic challenge, that records which passengers survived the Titanic wreckage, it was able by itself to derive interesting trends like "passengers that paid higher fares were more likely to survive" or "survival rate was much higher for women than men".

The cookbook even lets the agent built its own submission to the challenge, and it ranks under 3,000 out of 17,000 submissions: ๐Ÿ‘ not bad at all!

Try it for yourself in this Space demo ๐Ÿ‘‰ m-ric/agent-data-analyst
  • 2 replies
ยท
reacted to albertvillanova's post with ๐Ÿ”ฅ 6 months ago
view post
Post
2695
Easily convert your script-based datasets to Parquet and explore them in the dataset viewer. ๐ŸŒŸ

๐Ÿ› ๏ธ Use @huggingface Datasets CLI:
$ ๐š๐šŠ๐š๐šŠ๐šœ๐šŽ๐š๐šœ-๐šŒ๐š•๐š’ ๐šŒ๐š˜๐š—๐šŸ๐šŽ๐š›๐š_๐š๐š˜_๐š™๐šŠ๐š›๐šš๐šž๐šŽ๐š ๐š„๐š‚๐™ด๐š๐™ฝ๐™ฐ๐™ผ๐™ด/๐™ณ๐™ฐ๐šƒ๐™ฐ๐š‚๐™ด๐šƒ_๐™ฝ๐™ฐ๐™ผ๐™ด

Learn more: https://huggingface.co/docs/datasets/main/en/cli#convert-to-parquet
#Data #AI
reacted to davanstrien's post with ๐Ÿ”ฅ 6 months ago
view post
Post
951
In my ongoing quest to learn more about building synthetic datasets, I've created an "Awesome Synthetic Datasets" list.

The aim is to lightly curate a collection of resources, tutorials, and tools for generating synthetic datasets using large language models.

I plan to add some "key techniques" to the repo, but for now, it focuses on important datasets, papers, and tools.

๐Ÿ”— https://github.com/davanstrien/awesome-synthetic-datasets
reacted to tomaarsen's post with โค๏ธ 9 months ago
view post
Post
๐Ÿค— Sentence Transformers v2.4.0 for embedding models is now out! It introduces a lot of powerful features, such as:

1. Matryoshka Loss function - you can now train & perform inference on ๐Ÿช† Matryoshka Embedding models. See also our blogpost: https://huggingface.co/blog/matryoshka

2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs: https://sbert.net/docs/package_reference/losses.html#cosentloss

3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using model.encode(..., prompt_name="query") which will include a prompt with the name "query". More info in the docs: https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates

4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here: https://sbert.net/docs/pretrained_models.html#instructor-models

5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!

6. Updated documentation: a new Loss Overview section: https://sbert.net/docs/training/loss_overview.html and more detailed loss functions: https://sbert.net/docs/package_reference/losses.html

And much more! See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.4.0

Some more very exciting updates are still on the horizon!
ยท