arxiv:2412.11834

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

Published on Dec 16

· Submitted by

JingzeShi on Dec 17

Upvote

Authors:

Jingze Shi ,

Bingheng Wu

Abstract

In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8 to 10 times faster than the mixture of experts, to ensure that the combining state transformation quickly retrieval mixture. Finally, we summarize these matrix algorithms that can form the foundation model: Wonderful Matrices, which can be a competitor to popular model architectures.

View arXiv page View PDF Add to collection

Community

JingzeShi

Paper author Paper submitter 1 day ago

•

edited 1 day ago

The paper not only completes the combination of Transformer and Mamba, but also discusses the possibility of Transformer using self-attention during training and state-space for inference.

Again, we discussed the Doge architecture, which can switch between self-attention and state-space.