Q-Sparse: All Large Language Models can be Fully Sparsely-Activated Paper • 2407.10969 • Published Jul 15 • 20
Direct Preference Knowledge Distillation for Large Language Models Paper • 2406.19774 • Published Jun 28 • 21
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20 • 47
BitNet: Scaling 1-bit Transformers for Large Language Models Paper • 2310.11453 • Published Oct 17, 2023 • 96
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 170