Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?

4 replies

Stephen Oates PRO

AI & ML interests

Recent Activity

Organizations

soates's activity

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Open-LLM performances are plateauing, let’s make the leaderboard steep again

Llama-3.1-Storm-8B: Improved SLM with Self-Curation + Model Merging

A failed experiment: Infini-Attention, and why we should keep trying?

Gpt2 Multiplication Predictor

Uncensor any LLM with abliteration

FineWeb: decanting the web for the finest text data at scale

Phi-3 WebGPU