metadata

language: en
tags:
  - llama
  - template-mlp
  - parameter-efficient
  - mlp-modification
datasets:
  - none
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers

RECASTMLP-LLaMA

This model implements a parameter-efficient modification of the LLaMA architecture by replacing the standard MLP layers with template-based shared MLPs. The model maintains LLaMA's attention mechanism while reducing parameters in the feed-forward networks.

Model Description

Overview

RECASTMLP-LLaMA modifies the original LLaMA architecture by introducing template banks for MLP layers. Instead of having separate MLP weights for each transformer layer, it uses a shared set of template weights that are combined using learned coefficients.

Architecture Details

Base Model: LLaMA 3.1 8B
Number of Templates: 4
Number of Groups: 8
Coefficients per Template: 1
Coefficients 392
Hidden Size: 4096
Intermediate Size: 14336
Number of Attention Heads: 32
Number of Key-Value Heads: 8
Number of Layers: 32
Max Position Embeddings: 131072
Vocabulary Size: 128256

Key Features

Template Banks: Uses shared template weights across groups of layers
Parameter Efficiency: Reduces the total number of parameters by sharing MLP weights
Group-wise Sharing: Organizes layers into groups that share template banks
Coefficient Learning: Uses learned coefficients to combine template weights

Usage

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("appledora/RECASTMLP-llama3.1-f8t4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8b")

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate output
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state