RECASTMLP-llama3.1-f8t4 / model_card.md
appledora's picture
Upload 6 files
7a1d06b verified
|
raw
history blame
1.94 kB
metadata
language: en
tags:
  - llama
  - template-mlp
  - parameter-efficient
  - mlp-modification
datasets:
  - none
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers

RECASTMLP-LLaMA

This model implements a parameter-efficient modification of the LLaMA architecture by replacing the standard MLP layers with template-based shared MLPs. The model maintains LLaMA's attention mechanism while reducing parameters in the feed-forward networks.

Model Description

Overview

RECASTMLP-LLaMA modifies the original LLaMA architecture by introducing template banks for MLP layers. Instead of having separate MLP weights for each transformer layer, it uses a shared set of template weights that are combined using learned coefficients.

Architecture Details

  • Base Model: LLaMA 3.1 8B
  • Number of Templates: 4
  • Number of Groups: 8
  • Coefficients per Template: 1
  • Coefficients 392
  • Hidden Size: 4096
  • Intermediate Size: 14336
  • Number of Attention Heads: 32
  • Number of Key-Value Heads: 8
  • Number of Layers: 32
  • Max Position Embeddings: 131072
  • Vocabulary Size: 128256

Key Features

  1. Template Banks: Uses shared template weights across groups of layers
  2. Parameter Efficiency: Reduces the total number of parameters by sharing MLP weights
  3. Group-wise Sharing: Organizes layers into groups that share template banks
  4. Coefficient Learning: Uses learned coefficients to combine template weights

Usage

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("appledora/RECASTMLP-llama3.1-f8t4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8b")

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate output
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state