Collections
Discover the best community collections!
Collections including paper arxiv:2412.09764
-
STaR: Bootstrapping Reasoning With Reasoning
Paper • 2203.14465 • Published • 8 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 47 -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 17 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 29
-
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Paper • 2402.15627 • Published • 35 -
One Wide Feedforward is All You Need
Paper • 2309.01826 • Published • 32 -
Fast Feedforward Networks
Paper • 2308.14711 • Published • 3 -
Memory Layers at Scale
Paper • 2412.09764 • Published • 3
-
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper • 2208.01626 • Published • 2 -
BERT Rediscovers the Classical NLP Pipeline
Paper • 1905.05950 • Published • 2 -
A Multiscale Visualization of Attention in the Transformer Model
Paper • 1906.05714 • Published • 2 -
Analyzing Transformers in Embedding Space
Paper • 2209.02535 • Published • 3