Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2412.09764

Papers - Training - MoE - Expert Choice

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Interpretability - MoE

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Training - MoE

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Custom Layers - PEER - Single Embedding

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Attention - Gating - Input - Silu Non-linearity

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Training - Memory Augmented

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Custom Layers - Memory - Index

Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

advancing research

STaR: Bootstrapping Reasoning With Reasoning

Paper • 2203.14465 • Published Mar 28, 2022 • 8
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11, 2024 • 47
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 17
Prompt Cache: Modular Attention Reuse for Low-Latency Inference

Paper • 2311.04934 • Published Nov 7, 2023 • 29

Papers - Custom Layers - Feedforward Neural Network (FFN)

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Paper • 2402.15627 • Published Feb 23, 2024 • 35
One Wide Feedforward is All You Need

Paper • 2309.01826 • Published Sep 4, 2023 • 32
Fast Feedforward Networks

Paper • 2308.14711 • Published Aug 28, 2023 • 3
Memory Layers at Scale

Paper • 2412.09764 • Published Dec 12, 2024 • 3

Papers - Interpretability

Prompt-to-Prompt Image Editing with Cross Attention Control

Paper • 2208.01626 • Published Aug 2, 2022 • 2
BERT Rediscovers the Classical NLP Pipeline

Paper • 1905.05950 • Published May 15, 2019 • 2
A Multiscale Visualization of Attention in the Transformer Model

Paper • 1906.05714 • Published Jun 12, 2019 • 2
Analyzing Transformers in Embedding Space

Paper • 2209.02535 • Published Sep 6, 2022 • 3

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs