Text Generation
qwen3
math
trimkv
KV
Cache
Compression

Add pipeline tag and paper link to model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -11
README.md CHANGED
@@ -1,9 +1,10 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - open-r1/OpenR1-Math-220k
5
  base_model:
6
  - Qwen/Qwen3-14B
 
 
 
 
7
  tags:
8
  - math
9
  - trimkv
@@ -12,15 +13,18 @@ tags:
12
  - Compression
13
  ---
14
 
15
- > TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.
 
 
16
 
17
  The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
18
 
19
  The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
20
 
21
-
22
  <a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
23
 
 
 
24
 
25
  ### Why TRIM-KV?
26
 
@@ -62,8 +66,6 @@ And it's interpretable
62
  pip install -r requirements.txt
63
  ```
64
 
65
- This is a minimal set of requirements for training purposes. Additional dependencies may be needed for running specific experiments. We provided a full example of the environment used in our experiments in [`examples/env.yaml`](examples/env.yaml).
66
-
67
  ### Installation
68
 
69
  From the root of the repo:
@@ -72,7 +74,7 @@ From the root of the repo:
72
  git clone https://github.com/ngocbh/trimkv.git
73
  cd trimkv
74
  pip install -e .
75
- ````
76
 
77
  ---
78
 
@@ -84,7 +86,7 @@ from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
84
  from trimkv.cache_utils import TrimKVCache
85
  from transformers import AutoTokenizer
86
 
87
- model_path = "<TrimKV model_path here>"
88
  download_from = "huggingface" # options: "wandb", "local", "huggingface"
89
 
90
  model = TrimKVQwen3ForCausalLM.from_pretrained(
@@ -112,7 +114,7 @@ tokenizer = AutoTokenizer.from_pretrained(
112
  # Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
113
  ```
114
 
115
- For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_qwen3.py).
116
 
117
  ## Released Models
118
 
@@ -126,4 +128,13 @@ For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_
126
  | Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 512 |
127
  | DeepSeek-R1-Distill-Llama-8B | [TrimKV-DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8B) | OpenR1-Math-220k | 32K | 256 |
128
 
129
- ---
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen3-14B
4
+ datasets:
5
+ - open-r1/OpenR1-Math-220k
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
  tags:
9
  - math
10
  - trimkv
 
13
  - Compression
14
  ---
15
 
16
+ # TrimKV: Token Retention for Memory-Bounded Key-Value Eviction
17
+
18
+ TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference. It was introduced in the paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649) by Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, and Rex Ying.
19
 
20
  The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
21
 
22
  The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
23
 
 
24
  <a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
25
 
26
+ - **Official Code:** [GitHub - ngocbh/trimkv](https://github.com/ngocbh/trimkv)
27
+ - **Paper:** [https://huggingface.co/papers/2605.09649](https://huggingface.co/papers/2605.09649)
28
 
29
  ### Why TRIM-KV?
30
 
 
66
  pip install -r requirements.txt
67
  ```
68
 
 
 
69
  ### Installation
70
 
71
  From the root of the repo:
 
74
  git clone https://github.com/ngocbh/trimkv.git
75
  cd trimkv
76
  pip install -e .
77
+ ```
78
 
79
  ---
80
 
 
86
  from trimkv.cache_utils import TrimKVCache
87
  from transformers import AutoTokenizer
88
 
89
+ model_path = "ngocbh/TrimKV-Qwen3-14B-Math"
90
  download_from = "huggingface" # options: "wandb", "local", "huggingface"
91
 
92
  model = TrimKVQwen3ForCausalLM.from_pretrained(
 
114
  # Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
115
  ```
116
 
117
+ For a runnable end-to-end example, see [`examples/test_qwen3.py`](https://github.com/ngocbh/trimkv/blob/main/examples/test_qwen3.py).
118
 
119
  ## Released Models
120
 
 
128
  | Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 512 |
129
  | DeepSeek-R1-Distill-Llama-8B | [TrimKV-DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8B) | OpenR1-Math-220k | 32K | 256 |
130
 
131
+ ## Citation
132
+
133
+ ```bibtex
134
+ @article{bui2025make,
135
+ title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
136
+ author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
137
+ journal={arXiv preprint arXiv:2512.03324},
138
+ year={2025}
139
+ }
140
+ ```