license: apache-2.0
Model Card for cerebras/Cerebras-LLaVA-7B
The checkpoints consists of Language encoder and projector weights of multimodal LLaVA-7B model trained with our Cerebras implementation and training recipe. The vision encoder checkpoints for this model can be found at cerebras/Cerebras-ViT-L-336-patch14-llava7b-ShareGPT4V
Note: ShareGPT4V is added to the vision model name to ensure correct loading of checkpoints in LLaVA source repo
For full details of this model and training details, please read our paper and release blog post to be released shortly.
Model Architecture
Cerebras-LLaVA-7B is a transformer model with the following architecture details
- Vision encoder: CLIP-VisionModel-Large. It handles images of size 336 x 336 with patch size of 14
- Large Language Model: Pretrained from Vicuna-7B checkpoints and instruction finetuned on various datasets.
- Projector: the projector module that connects the LLM and Vision encoder part consists of two linear layers with gelu activation (mlp2x-gelu)
Loading the model
This model can directly be loaded using the LLaVa source code repository. For installation, please refer to the instructions in source code repository.
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
model_path = "cerebras/Cerebras-LLaVA-7B"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
Acknowledgements
We are thankful to all Cerebras engineers, past and present, that made this work possible.