stereoplegic
's Collections
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
•
2306.06000
•
Published
•
1
Fast Distributed Inference Serving for Large Language Models
Paper
•
2305.05920
•
Published
•
1
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline
Paper
•
2305.13144
•
Published
•
1
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
Paper
•
2303.06182
•
Published
•
1
Dynamic Context Pruning for Efficient and Interpretable Autoregressive
Transformers
Paper
•
2305.15805
•
Published
•
1
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
•
2311.01282
•
Published
•
35
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper
•
2311.03285
•
Published
•
28
Fast Inference from Transformers via Speculative Decoding
Paper
•
2211.17192
•
Published
•
4
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
•
2311.04934
•
Published
•
28
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Paper
•
2308.03421
•
Published
•
7
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens
Paper
•
2305.04241
•
Published
•
1
Latency Adjustable Transformer Encoder for Language Understanding
Paper
•
2201.03327
•
Published
•
1
Punica: Multi-Tenant LoRA Serving
Paper
•
2310.18547
•
Published
•
2
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity
Paper
•
2309.10285
•
Published
•
1
Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
Paper
•
2312.08361
•
Published
•
25
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
•
2312.04985
•
Published
•
38
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model
Scaling Laws
Paper
•
2401.00448
•
Published
•
28
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
•
2312.17238
•
Published
•
7
Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference
Paper
•
2401.08383
•
Published
•
1
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
•
2402.07033
•
Published
•
16
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Paper
•
2405.02842
•
Published
•
1