Aaron Chibb

aari1995

AI & ML interests

Multilinguality and German LLMs

Recent Activity

liked a dataset 12 days ago
gretelai/gretel-pii-masking-en-v1
liked a model 17 days ago
UKPLab/triple-encoders-dailydialog
liked a model 27 days ago
ibm-granite/granite-3.0-8b-instruct

Organizations

aari1995's activity

reacted to louisbrulenaudet's post with 👍 3 months ago
view post
Post
2757
🚀 RAGoon is now available on PyPI, GitHub, and as a Space on Hugging Face for batched embeddings generation 🤗

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.

Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the 🤗 Space: louisbrulenaudet/ragoon
replied to tomaarsen's post 4 months ago
view reply

great as always!! mostly colbert I think would be great, the people at RAGatouille are also doing great stuff but having it integrated in ST would be sooo cool!

reacted to tomaarsen's post with ❤️ 4 months ago
view post
Post
3299
I just published Sentence Transformers v3.0.1: the first patch release since v3 from last week. It introduces gradient checkpointing, pushing model checkpoints to Hugging Face while training, model card improvements and fixes. Details:

1️⃣ Gradient checkpointing allows for much less memory usage at a cost of ~20% training speed. Seems to allow for higher batch sizes, which is quite important for loss functions with in-batch negatives.
2️⃣ You can specify args.push_to_hub=True and args.hub_model_id to upload your model checkpoints to Hugging Face while training. It also uploads your emissions (if codecarbon is installed) and your Tensorboard logs (if tensorboard is installed)
3️⃣ Model card improvements: improved automatic widget examples, better tags, and the default of "sentence_transformers_model_id" now gets replaced when possible.
4️⃣ Several evaluator fixes, see release notes for details.
5️⃣ Fixed a bug with MatryoshkaLoss throwing an error if the supplied Matryoshka dimensions are ascending instead of descending.
6️⃣ Full Safetensors support; even the uncommon modules can now save and load "model.safetensors" files: no more pickle risks.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.1

And let me know what kind of features you'd like to see next! I have some plans already (ONNX, Sparse models, ColBERT, PEFT), but I don't yet know how I should prioritize everything.
·
reacted to alielfilali01's post with 🔥 6 months ago
view post
Post
1148
Yesterday was just CRAZY ! HF x LangChain, PaliGemma and Google I/O ... which made me totally forget posting here about our newly released leaderboard (The Open Arabic LLM Leaderboard - OALL)

Here's a quick update for our community that is waiting for new results. Some of you noticed that since the release yesterday, the finished evaluations tab has stayed at 14 models up until now (May 15th, 12 PM). For those concerned, rest assured—we had a minor memory issue in our cluster yesterday that we overlooked. The problem is now fixed, and 7 models are currently being evaluated in parallel, so expect to hit the 20 milestone today! 🎉

Check the discussion below for more info :

OALL/Open-Arabic-LLM-Leaderboard#3
reacted to not-lain's post with 🔥 7 months ago
view post
Post
1779
🚀 just reached 3K+ readers on this blog post about RAG using only HF🤗 related tools in just a little over 1 week from publishing.

📃the most interesting thing about it is that you can use the FAISS index in the datasets library to retrieve your most similar documents.

🔗https://huggingface.co/blog/not-lain/rag-chatbot-using-llama3

Happy reading everyone ✨
reacted to urchade's post with ❤️ 8 months ago
view post
Post
7647
**Some updates on GLiNER**

🆕 A new commercially permissible multilingual version is available urchade/gliner_multiv2.1

🐛 A subtle bug that causes performance degradation on some models has been corrected. Thanks to @yyDing1 for raising the issue.

from gliner import GLiNER

# Initialize GLiNER
model = GLiNER.from_pretrained("urchade/gliner_multiv2.1")

text = "This is a text about Bill Gates and Microsoft."

# Labels for entity prediction
labels = ["person", "organization", "email"]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])
·
replied to their post 8 months ago
view reply

no there will be some filtering happening, working on the algorithm currently to do so.

reacted to their post with 🚀 8 months ago
view post
Post
3229
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)

ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual

multilingual/orca_dpo_pairs
·
posted an update 8 months ago
view post
Post
3229
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)

ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual

multilingual/orca_dpo_pairs
·
reacted to m-ric's post with ❤️ 9 months ago
view post
Post
📚🔎 If you're building RAG applications, you should check this out:

⚙️ I've built a new space to let you visualize the chunks you get with different text splitting methods!

➡️ Visualize your chunks here:
m-ric/chunk_visualizer
  • 2 replies
·
posted an update 9 months ago
view post
Post
looking at the tokenizer and the naming (“_en“), Google Gemma is very likely to have a multilingual counterpart. 👀

Thoughts?
  • 3 replies
·
posted an update 9 months ago
view post
Post
@clem ist das der erste nicht Englische post auf huggingface?👋🏽 🇩🇪🇫🇷🇮🇹🇪🇸🇮🇳…
  • 1 reply
·