File size: 1,938 Bytes

7a1d06b

---
language: en
tags:
- llama
- template-mlp
- parameter-efficient
- mlp-modification
datasets:
- none
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
---

# RECASTMLP-LLaMA

This model implements a parameter-efficient modification of the LLaMA architecture by replacing the standard MLP layers with template-based shared MLPs. The model maintains LLaMA's attention mechanism while reducing parameters in the feed-forward networks.

## Model Description

### Overview
RECASTMLP-LLaMA modifies the original LLaMA architecture by introducing template banks for MLP layers. Instead of having separate MLP weights for each transformer layer, it uses a shared set of template weights that are combined using learned coefficients.

### Architecture Details
- **Base Model:** LLaMA 3.1 8B
- **Number of Templates:** 4
- **Number of Groups:** 8
- **Coefficients per Template:** 1
- **Coefficients** 392
- **Hidden Size:** 4096
- **Intermediate Size:** 14336
- **Number of Attention Heads:** 32
- **Number of Key-Value Heads:** 8
- **Number of Layers:** 32
- **Max Position Embeddings:** 131072
- **Vocabulary Size:** 128256


### Key Features
1. **Template Banks:** Uses shared template weights across groups of layers
2. **Parameter Efficiency:** Reduces the total number of parameters by sharing MLP weights
3. **Group-wise Sharing:** Organizes layers into groups that share template banks
4. **Coefficient Learning:** Uses learned coefficients to combine template weights

## Usage

```python
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("appledora/RECASTMLP-llama3.1-f8t4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8b")

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate output
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state