--- language: en tags: - llama - template-mlp - parameter-efficient - mlp-modification datasets: - none license: apache-2.0 pipeline_tag: text-generation library_name: transformers --- # RECASTMLP-LLaMA This model implements a parameter-efficient modification of the LLaMA architecture by replacing the standard MLP layers with template-based shared MLPs. The model maintains LLaMA's attention mechanism while reducing parameters in the feed-forward networks. ## Model Description ### Overview RECASTMLP-LLaMA modifies the original LLaMA architecture by introducing template banks for MLP layers. Instead of having separate MLP weights for each transformer layer, it uses a shared set of template weights that are combined using learned coefficients. ### Architecture Details - **Base Model:** LLaMA 3.1 8B - **Number of Templates:** 4 - **Number of Groups:** 8 - **Coefficients per Template:** 1 - **Coefficients** 392 - **Hidden Size:** 4096 - **Intermediate Size:** 14336 - **Number of Attention Heads:** 32 - **Number of Key-Value Heads:** 8 - **Number of Layers:** 32 - **Max Position Embeddings:** 131072 - **Vocabulary Size:** 128256 ### Key Features 1. **Template Banks:** Uses shared template weights across groups of layers 2. **Parameter Efficiency:** Reduces the total number of parameters by sharing MLP weights 3. **Group-wise Sharing:** Organizes layers into groups that share template banks 4. **Coefficient Learning:** Uses learned coefficients to combine template weights ## Usage ```python from transformers import AutoModel, AutoTokenizer # Load model and tokenizer model = AutoModel.from_pretrained("appledora/RECASTMLP-llama3.1-f8t4", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8b") # Prepare input text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") # Generate output outputs = model(**inputs) hidden_states = outputs.last_hidden_state