metadata
language: en
tags:
- llama
- template-mlp
- parameter-efficient
- mlp-modification
datasets:
- none
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
RECASTMLP-LLaMA
This model implements a parameter-efficient modification of the LLaMA architecture by replacing the standard MLP layers with template-based shared MLPs. The model maintains LLaMA's attention mechanism while reducing parameters in the feed-forward networks.
Model Description
Overview
RECASTMLP-LLaMA modifies the original LLaMA architecture by introducing template banks for MLP layers. Instead of having separate MLP weights for each transformer layer, it uses a shared set of template weights that are combined using learned coefficients.
Architecture Details
- Base Model: LLaMA 3.1 8B
- Number of Templates: 4
- Number of Groups: 8
- Coefficients per Template: 1
- Coefficients 392
- Hidden Size: 4096
- Intermediate Size: 14336
- Number of Attention Heads: 32
- Number of Key-Value Heads: 8
- Number of Layers: 32
- Max Position Embeddings: 131072
- Vocabulary Size: 128256
Key Features
- Template Banks: Uses shared template weights across groups of layers
- Parameter Efficiency: Reduces the total number of parameters by sharing MLP weights
- Group-wise Sharing: Organizes layers into groups that share template banks
- Coefficient Learning: Uses learned coefficients to combine template weights
Usage
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("appledora/RECASTMLP-llama3.1-f8t4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8b")
# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate output
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state