leonardlin
's Collections
LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
Paper
•
2312.11514
•
Published
•
257
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
•
2312.12456
•
Published
•
40
Accelerating LLM Inference with Staged Speculative Decoding
Paper
•
2308.04623
•
Published
•
23
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
•
2208.07339
•
Published
•
4
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
•
2309.06180
•
Published
•
25
Efficient LLM inference solution on Intel GPU
Paper
•
2401.05391
•
Published
•
9
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
•
2401.08671
•
Published
•
14
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
54
Zero Bubble Pipeline Parallelism
Paper
•
2401.10241
•
Published
•
23
Lookahead: An Inference Acceleration Framework for Large Language Model
with Lossless Generation Accuracy
Paper
•
2312.12728
•
Published
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Paper
•
2401.15077
•
Published
•
19
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
•
2401.15024
•
Published
•
69
Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models
Paper
•
2311.03687
•
Published
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
•
2402.05099
•
Published
•
19
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Paper
•
2402.02750
•
Published
•
3
Paper
•
2402.04925
•
Published
•
3
FAST: Factorizable Attention for Speeding up Transformers
Paper
•
2402.07901
•
Published
•
1
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
Paper
•
2401.18079
•
Published
•
7
CATS: Contextually-Aware Thresholding for Sparsity in Large Language
Models
Paper
•
2404.08763
•
Published
•
2
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
•
2405.10637
•
Published
•
19
Distributed Speculative Inference of Large Language Models
Paper
•
2405.14105
•
Published
•
16
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper
•
2406.16858
•
Published
•
1
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Paper
•
2406.06282
•
Published
•
36