--- license: apache-2.0 --- # Model Card for Zamba2-1.2B Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically: 1. SFT of the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model on [ultrachat_200k](HuggingFaceH4/ultrachat_200k) and [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) 2. DPO of the SFT checkpoint on [ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs), and [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences) Zamba2-1.2B-Instruct is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks. It is based on the [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) architecture. ## Quick start ### Prerequisites To download Zamba2-1.2B, clone Zyphra's fork of transformers: 1. `git clone https://github.com/Zyphra/transformers_zamba2.git` 2. `cd transformers_zamba2` 3. Install the repository: `pip install -e .` 4. `pip install accelerate` You can run the model without using the optimized Mamba2 kernels, but it is **not** recommended as it will result in significantly higher latency and memory usage. To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``. ### Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Instantiate model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct") model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct", device_map="cuda", torch_dtype=torch.bfloat16) # Format the input as a chat template prompt = "What factors contributed to the fall of the Roman Empire?" sample = [{'role': 'user', 'content': prompt}] chat_sample = tokenizer.apply_chat_template(sample, tokenize=False) # Tokenize input and generate output input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda") outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False) print((tokenizer.decode(outputs[0]))) ``` ## Performance Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size. | Model | Size | MT-Bench | IFEval | |-------------|----|----|----| | **Zamba2-1.2B-Instruct** | 1.2B | **59.53** | **41.45** | | Gemma2-2B-Instruct | 2.7B | 51.69 | 42.20 | | H2O-Danube-1.6B-Chat | 1.6B | 49.78 | 27.95 | | StableLM-1.6B-Chat | 1.6B | 49.87 | 33.77 | | SmolLM-1.7B-Instruct | 1.7B | 43.37 | 16.53 | | Qwen2-1.5B-Instruct | 1.5B | N/A | 34.68 | Moreover, due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.