Huiqiang Jiang PRO

iofu728

AI & ML interests

None yet

Recent Activity

updated a Space about 1 month ago
microsoft/MInference
upvoted a paper about 2 months ago
Differential Transformer
View all activity

Articles

Organizations

iofu728's activity

updated a Space about 1 month ago
upvoted an article 2 months ago
view article
Article

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

β€’ 203
upvoted an article 3 months ago
view article
Article

A failed experiment: Infini-Attention, and why we should keep trying?

β€’ 50
published an article 4 months ago
view article
Article

How to Optimize TTFT of 8B LLMs with 1M Tokens toΒ 20s

By iofu728 β€’
β€’ 2
New activity in microsoft/MInference 4 months ago

setup.py error

1
#1 opened 5 months ago by NicDev
upvoted 2 articles 5 months ago
view article
Article

RegMix: Data Mixture as Regression for Language Model Pre-training

By SivilTaram β€’
β€’ 10
view article
Article

MInference 1.0: 10x Faster Million Context Inference with a Single GPU

By liyucheng β€’
β€’ 11
posted an update 5 months ago
view post
Post
1067
Weclome to use MInference, which leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for million tokens LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference
New activity in zero-gpu-explorers/README 5 months ago