Optimizing Mixtral-8x7B-Instruct-v0.1 for Hugging Face Chat

#54

by Husain - opened Dec 19, 2023

Discussion

Husain

Dec 19, 2023

•

edited Dec 19, 2023

What kind of optimizations are used to run MistralAI/Mixtral-8x7B-Instruct-v0.1 in Hugging Face Chat https://huggingface.co/chat ? Is this the default model in full precision?
Or are there optimizations to reduce memory requirements for running the model? like using float16 or (8-bit & 4-bit) using bitsandbytes
Is Flash Attention 2 is used too ?

ybelkada

Dec 20, 2023

Hi @Husain
I think HuggingChat uses TGI under the hood: https://github.com/huggingface/text-generation-inference
Specifically here: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment