Kenneth Hamilton PRO
ZennyKenny
AI & ML interests
Development and Ops for LLMs and CV.
Recent Activity
posted
an
update
about 21 hours ago
On-demand audio transcription is an often-requested service without many good options on the market.
Using Hugging Face Spaces with Gradio SDK and the OpenAI Whisper model, I've put together a simple interface that supports the transcription and summarisation of audio files up to five minutes in length, completely open source and running on CPU upgrade. The cool thing is that it's built without a dedicated inference endpoint, completely on public infrastructure.
Check it out: https://huggingface.co/spaces/ZennyKenny/AudioTranscribe
I wrote a short article about the backend mechanics for those who are interested: https://huggingface.co/blog/ZennyKenny/on-demand-public-transcription
published
an
article
about 21 hours ago
On-Demand Audio Transcription using Public Infrastructure
updated
a Space
about 22 hours ago
ZennyKenny/AudioTranscribe
Articles
Organizations
ZennyKenny's activity
posted
an
update
about 21 hours ago
Post
261
On-demand audio transcription is an often-requested service without many good options on the market.
Using Hugging Face Spaces with Gradio SDK and the OpenAI Whisper model, I've put together a simple interface that supports the transcription and summarisation of audio files up to five minutes in length, completely open source and running on CPU upgrade. The cool thing is that it's built without a dedicated inference endpoint, completely on public infrastructure.
Check it out: ZennyKenny/AudioTranscribe
I wrote a short article about the backend mechanics for those who are interested: https://huggingface.co/blog/ZennyKenny/on-demand-public-transcription
Using Hugging Face Spaces with Gradio SDK and the OpenAI Whisper model, I've put together a simple interface that supports the transcription and summarisation of audio files up to five minutes in length, completely open source and running on CPU upgrade. The cool thing is that it's built without a dedicated inference endpoint, completely on public infrastructure.
Check it out: ZennyKenny/AudioTranscribe
I wrote a short article about the backend mechanics for those who are interested: https://huggingface.co/blog/ZennyKenny/on-demand-public-transcription
reacted to
takarajordan's
post with 🔥
about 2 months ago
Post
1209
I'm not sure why I haven't done this already!
I just made a space to count and visualize tokens for Diffusion models, no more guesswork! It's super fast too.
Check it out here and try out your prompts: takarajordan/DiffusionTokenizer
Uses these tokenizers below:
openai/clip-vit-large-patch14
google/t5-v1_1-xxl
I just made a space to count and visualize tokens for Diffusion models, no more guesswork! It's super fast too.
Check it out here and try out your prompts: takarajordan/DiffusionTokenizer
Uses these tokenizers below:
openai/clip-vit-large-patch14
google/t5-v1_1-xxl
reacted to
davidberenstein1957's
post with 🔥
about 2 months ago
Post
1709
Let’s make a generation of amazing image-generation models
The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!
The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.
Blog: https://huggingface.co/blog/burtenshaw/image-preferences
The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!
The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.
Blog: https://huggingface.co/blog/burtenshaw/image-preferences
reacted to
davanstrien's
post with 🔥
about 2 months ago
Post
2488
First dataset for the new Hugging Face Bluesky community organisation:
bluesky-community/one-million-bluesky-posts 🦋
📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗
Excited to see people build more open tools for a more open social media platform!
📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗
Excited to see people build more open tools for a more open social media platform!
reacted to
vincentg64's
post with 🧠
about 2 months ago
Post
1186
There is no such thing as a Trained LLM https://mltblog.com/3CEJ9Pt
What I mean here is that traditional LLMs are trained on tasks irrelevant to what they will do for the user. It’s like training a plane to efficiently operate on the runway, but not to fly. In short, it is almost impossible to train an LLM, and evaluating is just as challenging. Then, training is not even necessary. In this article, I dive on all these topics.
➡️ Training LLMs for the wrong tasks
Since the beginnings with Bert, training an LLM typically consists of predicting the next tokens in a sentence, or removing some tokens and then have your algorithm fill the blanks. You optimize the underlying deep neural networks to perform these supervised learning tasks as well as possible. Typically, it involves growing the list of tokens in the training set to billions or trillions, increasing the cost and time to train. However, recently, there is a tendency to work with smaller datasets, by distilling the input sources and token lists. After all, out of one trillion tokens, 99% are noise and do not contribute to improving the results for the end-user; they may even contribute to hallucinations. Keep in mind that human beings have a vocabulary of about 30,000 keywords, and that the number of potential standardized prompts on a specialized corpus (and thus the number of potential answers) is less than a million.
➡️ Read the full articles at https://mltblog.com/3CEJ9Pt, also featuring issues with evaluation metrics and the benefits of untrained LLMs.
What I mean here is that traditional LLMs are trained on tasks irrelevant to what they will do for the user. It’s like training a plane to efficiently operate on the runway, but not to fly. In short, it is almost impossible to train an LLM, and evaluating is just as challenging. Then, training is not even necessary. In this article, I dive on all these topics.
➡️ Training LLMs for the wrong tasks
Since the beginnings with Bert, training an LLM typically consists of predicting the next tokens in a sentence, or removing some tokens and then have your algorithm fill the blanks. You optimize the underlying deep neural networks to perform these supervised learning tasks as well as possible. Typically, it involves growing the list of tokens in the training set to billions or trillions, increasing the cost and time to train. However, recently, there is a tendency to work with smaller datasets, by distilling the input sources and token lists. After all, out of one trillion tokens, 99% are noise and do not contribute to improving the results for the end-user; they may even contribute to hallucinations. Keep in mind that human beings have a vocabulary of about 30,000 keywords, and that the number of potential standardized prompts on a specialized corpus (and thus the number of potential answers) is less than a million.
➡️ Read the full articles at https://mltblog.com/3CEJ9Pt, also featuring issues with evaluation metrics and the benefits of untrained LLMs.
reacted to
luigi12345's
post with 👍
about 2 months ago
Post
3725
MinimalScrap
Only Free Dependencies. Save it.It is quite useful uh.
Only Free Dependencies. Save it.It is quite useful uh.
!pip install googlesearch-python requests
from googlesearch import search
import requests
query = "Glaucoma"
for url in search(f"{query} site:nih.gov filetype:pdf", 20):
if url.endswith(".pdf"):
with open(url.split("/")[-1], "wb") as f: f.write(requests.get(url).content)
print("✅" + url.split("/")[-1])
print("Done!")
posted
an
update
about 2 months ago
Post
1213
I've joined the Bluesky community. Interested to see what decentralized social media looks like in action: https://bsky.app/profile/kghamilton.bsky.social
Looking forward to following other AI builders, tech enthusiasts, goth doomscrollers, and ironic meme creators.
Looking forward to following other AI builders, tech enthusiasts, goth doomscrollers, and ironic meme creators.
reacted to
malhajar's
post with 🔥
about 2 months ago
Post
4379
🇫🇷 Lancement officiel de l'OpenLLM French Leaderboard : initiative open-source pour référencer l’évaluation des LLMs francophones
Après beaucoup d’efforts et de sueurs avec Alexandre Lavallee, nous sommes ravis d’annoncer que le OpenLLMFrenchLeaderboard est en ligne sur Hugging Face (space url: le-leadboard/OpenLLMFrenchLeaderboard) la toute première plateforme dédiée à l’évaluation des grands modèles de langage (LLM) en français. 🇫🇷✨
Ce projet de longue haleine est avant tout une œuvre de passion mais surtout une nécessité absolue. Il devient urgent et vital d'oeuvrer à plus de transparence dans ce domaine stratégique des LLM dits multilingues. La première pièce à l'édifice est donc la mise en place d'une évaluation systématique et systémique des modèles actuels et futurs.
Votre modèle IA français est-il prêt à se démarquer ? Soumettez le dans notre espace, et voyez comment vous vous comparez par rapport aux autres modèles.
❓ Comment ça marche :
Soumettez votre LLM français pour évaluation, et nous le testerons sur des benchmarks de référence spécifiquement adaptés pour la langue française — notre suite de benchmarks comprend :
- BBH-fr : Raisonnement complexe
- IFEval-fr : Suivi d'instructions
- GPQA-fr : Connaissances avancées
- MUSR-fr : Raisonnement narratif
- MATH_LVL5-fr : Capacités mathématiques
- MMMLU-fr : Compréhension multitâche
Le processus est encore manuel, mais nous travaillons sur son automatisation, avec le soutien de la communauté Hugging Face.
@clem , on se prépare pour une mise à niveau de l’espace ? 😏👀
Ce n'est pas qu'une question de chiffres—il s'agit de créer une IA qui reflète vraiment notre langue, notre culture et nos valeurs. OpenLLMFrenchLeaderboard est notre contribution personnelle pour façonner l'avenir des LLM en France.
Après beaucoup d’efforts et de sueurs avec Alexandre Lavallee, nous sommes ravis d’annoncer que le OpenLLMFrenchLeaderboard est en ligne sur Hugging Face (space url: le-leadboard/OpenLLMFrenchLeaderboard) la toute première plateforme dédiée à l’évaluation des grands modèles de langage (LLM) en français. 🇫🇷✨
Ce projet de longue haleine est avant tout une œuvre de passion mais surtout une nécessité absolue. Il devient urgent et vital d'oeuvrer à plus de transparence dans ce domaine stratégique des LLM dits multilingues. La première pièce à l'édifice est donc la mise en place d'une évaluation systématique et systémique des modèles actuels et futurs.
Votre modèle IA français est-il prêt à se démarquer ? Soumettez le dans notre espace, et voyez comment vous vous comparez par rapport aux autres modèles.
❓ Comment ça marche :
Soumettez votre LLM français pour évaluation, et nous le testerons sur des benchmarks de référence spécifiquement adaptés pour la langue française — notre suite de benchmarks comprend :
- BBH-fr : Raisonnement complexe
- IFEval-fr : Suivi d'instructions
- GPQA-fr : Connaissances avancées
- MUSR-fr : Raisonnement narratif
- MATH_LVL5-fr : Capacités mathématiques
- MMMLU-fr : Compréhension multitâche
Le processus est encore manuel, mais nous travaillons sur son automatisation, avec le soutien de la communauté Hugging Face.
@clem , on se prépare pour une mise à niveau de l’espace ? 😏👀
Ce n'est pas qu'une question de chiffres—il s'agit de créer une IA qui reflète vraiment notre langue, notre culture et nos valeurs. OpenLLMFrenchLeaderboard est notre contribution personnelle pour façonner l'avenir des LLM en France.
posted
an
update
about 2 months ago
Post
360
Using AI to teach English as a Foreign Language? EFL teachers often have busy schedules, variable class sizes, and unexpected cancellations. Introducting VocabSova:
ZennyKenny/VocabSova
VocabSova is a simple chatbot interface that helps teachers create topical vocabulary lists, custom worksheets using that vocabulary, and group activities on a defined theme for a specific English-speaking level (according to CEFR international standards).
There is a great use case for AI in nearly every field, and language learning is a particularly apt domain in my opinion. VocabSova is in active development during its Alpha release, all feedback welcome.
VocabSova is a simple chatbot interface that helps teachers create topical vocabulary lists, custom worksheets using that vocabulary, and group activities on a defined theme for a specific English-speaking level (according to CEFR international standards).
There is a great use case for AI in nearly every field, and language learning is a particularly apt domain in my opinion. VocabSova is in active development during its Alpha release, all feedback welcome.
reacted to
jsulz's
post with 🔥
about 2 months ago
Post
2924
When the XetHub crew joined Hugging Face this fall,
@erinys
and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
https://huggingface.co/blog/from-files-to-chunks
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
https://huggingface.co/blog/from-files-to-chunks
reacted to
jsulz's
post with 🚀
about 2 months ago
Post
2074
In August, the XetHub team joined Hugging Face
- https://huggingface.co/blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.
You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248
- https://huggingface.co/blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.
You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248
reacted to
davanstrien's
post with 🚀
about 2 months ago
Post
1314
huggingface.co/DIBT
is dead! Long live https://huggingface.co/data-is-better-together!
We're working on some very cool projects so we're doing a bit of tidying of the Data is Better Together Hub org 🤓
reacted to
ArthurZ's
post with 🔥
about 2 months ago
Post
2900
Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support!
Contributions are welcome to support more models! 🔥
Contributions are welcome to support more models! 🔥
reacted to
fdaudens's
post with 🚀
2 months ago
Post
1178
🪄 MagicQuill: AI that reads your mind for image edits! Point at what bugs you, and it suggests the perfect fixes. No more manual editing headaches. Try it here:
AI4Editing/MagicQuill
posted
an
update
5 months ago
Post
693
Very excited to have made the list and been invited to OpenAI DevDay 2024 at the London event 30 October! Looking forward to seeing what the future of AI dev holds, connecting with other professionals in the field, and advocating for open source AI!
https://openai.com/devday/
https://openai.com/devday/
reacted to
Taylor658's
post with 👍
5 months ago
Post
2350
💡Andrew Ng recently gave a strong defense of Open Source AI models and the need to slow down legislative efforts in the US and the EU to restrict innovation in Open Source AI at Stanford GSB.
🎥See video below
https://youtu.be/yzUdmwlh1sQ?si=bZc690p8iubolXm_
🎥See video below
https://youtu.be/yzUdmwlh1sQ?si=bZc690p8iubolXm_
As usual, Andrew Ng states the cogent position concisely and clearly for people who may not be familiar with the memes of the AI world.
Personally, I think some government committee or agency that focuses on AI could be a good thing, but having seen regulatory body after regulatory body in the United States fumble well meaning attempts to stay informed and turn those attempts into suffocating legislation, it seems that the only realistic position to advocate is no regulation whatsoever simply because any foot in the door oversight or law is simply going to be warped into red tape and bureaucracy based on the ever-changing winds of the election cycle.
Looking forward to trying!
Thank you for sharing.