Pedro Cuenca's picture

Pedro Cuenca

pcuenq

AI & ML interests

None yet

Recent Activity

liked a Space about 10 hours ago
huggingfacejs/inference-widgets
liked a model about 10 hours ago
llamaindex/vdr-2b-multi-v1
liked a model about 10 hours ago
vidore/colpali-v1.2
View all activity

Articles

Organizations

Hugging Face's profile picture Google's profile picture Sentence Transformers's profile picture 🧨Diffusers's profile picture Flax Community's profile picture Hugging Face Internal Testing Organization's profile picture DALLE mini's profile picture ControlNet 1.1 Preview's profile picture I Hackathon Somos NLP: PLN en Español's profile picture SomosNLP's profile picture Huggingface.js's profile picture HuggingFaceM4's profile picture Apple's profile picture (De)fusing's profile picture Open-Source AI Meetup's profile picture Huggingface Projects's profile picture CompVis's profile picture CompVis Community's profile picture Diffusers Pipelines Library for Stable Diffusion's profile picture Core ML Projects's profile picture LocalCodeLLMs's profile picture Code Llama's profile picture UniverseTBD's profile picture Hands-On Generative AI with Transformers and Diffusion Models's profile picture Diffusers Demo at ICCV 2023's profile picture Hugging Face TB Research's profile picture Core ML Files's profile picture huggingPartyParis's profile picture adept-hf-collab's profile picture Enterprise Explorers's profile picture Latent Consistency's profile picture TTS Eval (OLD)'s profile picture ggml.ai's profile picture kotol's profile picture LocalLLaMA's profile picture gg-hf's profile picture Mistral AI EAP's profile picture Llzama's profile picture MLX Community's profile picture Hugging Face Assignments's profile picture IBM Granite's profile picture On-device Squad's profile picture TTS AGI's profile picture Social Post Explorers's profile picture Apple CoreNet Models 's profile picture hsramall's profile picture diffusers-internal-dev's profile picture gg-tt's profile picture Hugging Face Discord Community's profile picture LLHF's profile picture SLLHF's profile picture lbhf's profile picture Hugging Quants's profile picture Meta Llama's profile picture kmhf's profile picture nltpt's profile picture s0409's profile picture Mt Metrics's profile picture nltpt-q's profile picture dummyosan's profile picture Test Org's profile picture metavision's profile picture mv's profile picture Bert ... but new's profile picture qrias's profile picture open/ acc's profile picture wut?'s profile picture DDUF's profile picture None yet's profile picture Hugging Face Agents Course's profile picture

Posts 1

view post
Post
4848
OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!