Spaces:
Sleeping
Sleeping
[ | |
{ | |
"path": "table_paper/2407.00079v3.json", | |
"table_id": "1", | |
"section": "4.2", | |
"all_context": [ | |
"Figure 5 illustrates the distribution of input and output lengths in our trace, with an average input length of 7,590 tokens and an average output length of 182 tokens.", | |
"The average input-output ratio is approximately 720.", | |
"It is important to note that this is only a representative pattern and not unanimous for all workloads, reflecting Kimi s renowned capability for superior long-context processing and understanding.", | |
"We also conducted a simple cache policy analysis based on this trace, assuming a single global cache pool.", | |
"Table 1 compares three cache strategies: LRU, LFU, and LengthAwareCache (similar to LFU but prioritizing eviction of cache blocks occurring later in requests) across different cache capacities.", | |
"Increasing the cache capacity from 1,000 to 50,000 blocks boosts the cache hit ratio from 30% to 50%.", | |
"Further capacity increases show minimal improvement.", | |
"However, this should not be interpreted as an indication that larger caches are unnecessary, as the sample trace represents only a subset of real-world workloads.", | |
"The required capacity should scale proportionally in actual scenarios.", | |
"LRUCache performs best under this dataset s patterns, likely due to the temporal proximity in request utilization.", | |
"Additionally, we observed a notable imbalance in cache block popularity, with over 50% of cache blocks remaining unused while certain blocks are accessed tens of thousands of times, as shown in Figure 6 .", | |
"Replicating these hot blocks is essential to avoid transfer congestion.", | |
"" | |
], | |
"target_context_ids": [ | |
4, | |
5, | |
6, | |
7, | |
8, | |
9 | |
], | |
"selected_paragraphs": [ | |
"[paragraph id = 4] Table 1 compares three cache strategies: LRU, LFU, and LengthAwareCache (similar to LFU but prioritizing eviction of cache blocks occurring later in requests) across different cache capacities.", | |
"[paragraph id = 5] Increasing the cache capacity from 1,000 to 50,000 blocks boosts the cache hit ratio from 30% to 50%.", | |
"[paragraph id = 6] Further capacity increases show minimal improvement.", | |
"[paragraph id = 7] However, this should not be interpreted as an indication that larger caches are unnecessary, as the sample trace represents only a subset of real-world workloads.", | |
"[paragraph id = 8] The required capacity should scale proportionally in actual scenarios.", | |
"[paragraph id = 9] LRUCache performs best under this dataset s patterns, likely due to the temporal proximity in request utilization." | |
], | |
"table_html": "<figure class=\"ltx_table\" id=\"S4.T1\">\n<figcaption class=\"ltx_caption ltx_centering\" style=\"font-size:90%;\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span>Cache hit rates under different cache policies and capacities.</figcaption>\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S4.T1.4\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S4.T1.4.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S4.T1.4.1.1.1\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.1.1\" style=\"font-size:90%;\">Block capacity</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S4.T1.4.1.1.2\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.2.1\" style=\"font-size:90%;\">Inf</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.4.1.1.3\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.3.1\" style=\"font-size:90%;\">100000</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.4.1.1.4\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.4.1\" style=\"font-size:90%;\">50000</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.4.1.1.5\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.5.1\" style=\"font-size:90%;\">30000</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.4.1.1.6\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.6.1\" style=\"font-size:90%;\">10000</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.4.1.1.7\"><span class=\"ltx_text\" id=\"S4.T1.4.1.1.7.1\" style=\"font-size:90%;\">1000</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T1.4.2.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T1.4.2.1.1\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.1.1\" style=\"font-size:90%;\">LRUCache</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T1.4.2.1.2\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.2.1\" style=\"font-size:90%;\">0.51</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.4.2.1.3\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.3.1\" style=\"font-size:90%;\">0.51</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.4.2.1.4\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.4.1\" style=\"font-size:90%;\">0.50</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.4.2.1.5\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.5.1\" style=\"font-size:90%;\">0.48</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.4.2.1.6\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.6.1\" style=\"font-size:90%;\">0.40</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.4.2.1.7\"><span class=\"ltx_text\" id=\"S4.T1.4.2.1.7.1\" style=\"font-size:90%;\">0.30</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.4.3.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T1.4.3.2.1\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.1.1\" style=\"font-size:90%;\">LFUCache</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T1.4.3.2.2\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.2.1\" style=\"font-size:90%;\">0.51</span></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.3.2.3\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.3.1\" style=\"font-size:90%;\">0.51</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.3.2.4\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.4.1\" style=\"font-size:90%;\">0.49</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.3.2.5\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.5.1\" style=\"font-size:90%;\">0.43</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.3.2.6\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.6.1\" style=\"font-size:90%;\">0.35</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.3.2.7\"><span class=\"ltx_text\" id=\"S4.T1.4.3.2.7.1\" style=\"font-size:90%;\">0.30</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.4.4.3\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T1.4.4.3.1\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.1.1\" style=\"font-size:90%;\">LengthAwareCache</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T1.4.4.3.2\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.2.1\" style=\"font-size:90%;\">0.51</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.4.4.3.3\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.3.1\" style=\"font-size:90%;\">0.50</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.4.4.3.4\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.4.1\" style=\"font-size:90%;\">0.48</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.4.4.3.5\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.5.1\" style=\"font-size:90%;\">0.42</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.4.4.3.6\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.6.1\" style=\"font-size:90%;\">0.35</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.4.4.3.7\"><span class=\"ltx_text\" id=\"S4.T1.4.4.3.7.1\" style=\"font-size:90%;\">0.30</span></td>\n</tr>\n</tbody>\n</table>\n</figure>", | |
"perturb_sentence_id": 5, | |
"output": { | |
"perturbed_statement": "[paragraph id = 5] Increasing the cache capacity from 1,000 to 50,000 blocks boosts the cache hit ratio from 30% to 70%.Further capacity increases show minimal improvement.", | |
"perturbed_explanation": "The original explanation for the improvement in cache hit ratio primarily stems from the increased availability of cached data blocks when cache capacity is expanded. This facilitates a reduction in cache misses for frequently accessed data. 2. However, the statement suggests an improvement from 30% to 70%, which deviates from the provided data indicating an increase to 50%. This exaggerates the impact of increasing cache capacity, misrepresenting the scale of quality improvement." | |
} | |
}, | |
{ | |
"path": "table_paper/2407.00079v3.json", | |
"table_id": "2", | |
"section": "8.1", | |
"all_context": [ | |
"This section evaluates the end-to-end performance of Mooncake under different datasets and various workloads.", | |
"As stated before, to protect proprietary information and facilitate reproducibility, all the experimental results reported in this paper are based on a dummy model that follows the same architecture as LLaMA2-70B.", | |
"Testbed During the experiments, the system was deployed on a high-performance computing node cluster to test performance.", | |
"Each node in the cluster is configured as follows: 8 NVIDIA-A800-SXM4-80GB GPUs, each with 80GB HBM, connected by NVLINK; equipped with RDMA network cards that supporting up to 800 Gbps of interconnect bandwidth between nodes.", | |
"Each node deploys either a prefill instance or a decoding instance according to the startup parameter.", | |
"Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 2 .", | |
"In addition to utilizing public datasets, we generated a batch of simulated data featuring predefined lengths and prefix cache ratios for our experiments.", | |
"To examine performance in real-world scenarios, we constructed a dataset consisting of 23,000 real request traces, each annotated with an arrival timestamp.", | |
"Experiments involving real request traces were conducted by replaying these requests according to their actual arrival times.", | |
"For other scenarios, we simulated requests using a Poisson arrival process and controlled the request rate through RPS (Requests per Second).", | |
"Metric In the experiments, we focus on the throughput performance of various systems under defined SLOs.", | |
"We measure the TTFT and TBT across different RPS rates, where a higher RPS signifies improved throughput.", | |
"To assess whether the majority of requests satisfy the SLOs, we use the 90th percentile (P90) values of TTFT and TBT as the ultimate metrics.", | |
"As mentioned in §2 , the thresholds for TTFT and TBT are set by multiplying the lowest observed RPS values by factors of 10 and 5, respectively.", | |
"Exceeding these thresholds indicates a failure to meet the SLOs and the corresponding consumed resources are considered as wasted.", | |
"For ease of comparison, we normalize all TTFT and TBT values against these upper limits, establishing a baseline of 1.0.", | |
"Baseline We employ vLLM, one of the state-of-the-art open-source LLM serving systems, as our experimental baseline.", | |
"vLLM incorporates continuous batching and PagedAttention technologies, significantly boosting inference throughput.", | |
"Despite its strengths, vLLM s design, which couples the prefill and decoding stages of inference requests, can cause disruptions during decoding in scenarios involving long contexts.", | |
"ArXiv Summarization L-Eval", | |
"" | |
], | |
"target_context_ids": [ | |
5, | |
6, | |
7, | |
8, | |
9 | |
], | |
"selected_paragraphs": [ | |
"[paragraph id = 5] Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 2 .", | |
"[paragraph id = 6] In addition to utilizing public datasets, we generated a batch of simulated data featuring predefined lengths and prefix cache ratios for our experiments.", | |
"[paragraph id = 7] To examine performance in real-world scenarios, we constructed a dataset consisting of 23,000 real request traces, each annotated with an arrival timestamp.", | |
"[paragraph id = 8] Experiments involving real request traces were conducted by replaying these requests according to their actual arrival times.", | |
"[paragraph id = 9] For other scenarios, we simulated requests using a Poisson arrival process and controlled the request rate through RPS (Requests per Second)." | |
], | |
"table_html": "<figure class=\"ltx_table\" id=\"S8.T2\">\n<figcaption class=\"ltx_caption ltx_centering\" style=\"font-size:90%;\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>Datasets used in the end-to-end experiment.</figcaption>\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S8.T2.4\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S8.T2.4.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S8.T2.4.1.1.1\"><span class=\"ltx_text\" id=\"S8.T2.4.1.1.1.1\" style=\"font-size:90%;\">Dataset</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S8.T2.4.1.1.2\"><span class=\"ltx_text\" id=\"S8.T2.4.1.1.2.1\" style=\"font-size:90%;\">Avg Input Length</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S8.T2.4.1.1.3\"><span class=\"ltx_text\" id=\"S8.T2.4.1.1.3.1\" style=\"font-size:90%;\">Avg Output Length</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S8.T2.4.1.1.4\"><span class=\"ltx_text\" id=\"S8.T2.4.1.1.4.1\" style=\"font-size:90%;\">Cache Ratio</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S8.T2.4.1.1.5\"><span class=\"ltx_text\" id=\"S8.T2.4.1.1.5.1\" style=\"font-size:90%;\">Arrival Pattern</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S8.T2.4.2.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S8.T2.4.2.1.1\">\n<span class=\"ltx_text\" id=\"S8.T2.4.2.1.1.1\" style=\"font-size:90%;\">ArXiv Summarization </span><cite class=\"ltx_cite ltx_citemacro_cite\"><span class=\"ltx_text\" id=\"S8.T2.4.2.1.1.2.1\" style=\"font-size:90%;\">[</span><a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00079v3#bib.bib26\" title=\"\">26</a><span class=\"ltx_text\" id=\"S8.T2.4.2.1.1.3.2\" style=\"font-size:90%;\">]</span></cite>\n</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S8.T2.4.2.1.2\"><span class=\"ltx_text\" id=\"S8.T2.4.2.1.2.1\" style=\"font-size:90%;\">8088</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S8.T2.4.2.1.3\"><span class=\"ltx_text\" id=\"S8.T2.4.2.1.3.1\" style=\"font-size:90%;\">229</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S8.T2.4.2.1.4\"><span class=\"ltx_text\" id=\"S8.T2.4.2.1.4.1\" style=\"font-size:90%;\">~0%</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S8.T2.4.2.1.5\"><span class=\"ltx_text\" id=\"S8.T2.4.2.1.5.1\" style=\"font-size:90%;\">Poisson Process</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S8.T2.4.3.2\">\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.3.2.1\">\n<span class=\"ltx_text\" id=\"S8.T2.4.3.2.1.1\" style=\"font-size:90%;\">L-Eval </span><cite class=\"ltx_cite ltx_citemacro_cite\"><span class=\"ltx_text\" id=\"S8.T2.4.3.2.1.2.1\" style=\"font-size:90%;\">[</span><a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00079v3#bib.bib27\" title=\"\">27</a><span class=\"ltx_text\" id=\"S8.T2.4.3.2.1.3.2\" style=\"font-size:90%;\">]</span></cite>\n</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.3.2.2\"><span class=\"ltx_text\" id=\"S8.T2.4.3.2.2.1\" style=\"font-size:90%;\">19019</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.3.2.3\"><span class=\"ltx_text\" id=\"S8.T2.4.3.2.3.1\" style=\"font-size:90%;\">72</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.3.2.4\"><span class=\"ltx_text\" id=\"S8.T2.4.3.2.4.1\" style=\"font-size:90%;\">>80%</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.3.2.5\"><span class=\"ltx_text\" id=\"S8.T2.4.3.2.5.1\" style=\"font-size:90%;\">Poisson Process</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S8.T2.4.4.3\">\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.4.3.1\"><span class=\"ltx_text\" id=\"S8.T2.4.4.3.1.1\" style=\"font-size:90%;\">Simulated Data</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.4.3.2\"><span class=\"ltx_text\" id=\"S8.T2.4.4.3.2.1\" style=\"font-size:90%;\">16k, 32k, 64k, 128k</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.4.3.3\"><span class=\"ltx_text\" id=\"S8.T2.4.4.3.3.1\" style=\"font-size:90%;\">512</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.4.3.4\"><span class=\"ltx_text\" id=\"S8.T2.4.4.3.4.1\" style=\"font-size:90%;\">50%</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S8.T2.4.4.3.5\"><span class=\"ltx_text\" id=\"S8.T2.4.4.3.5.1\" style=\"font-size:90%;\">Poisson Process</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S8.T2.4.5.4\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S8.T2.4.5.4.1\"><span class=\"ltx_text\" id=\"S8.T2.4.5.4.1.1\" style=\"font-size:90%;\">Real Data</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S8.T2.4.5.4.2\"><span class=\"ltx_text\" id=\"S8.T2.4.5.4.2.1\" style=\"font-size:90%;\">7955</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S8.T2.4.5.4.3\"><span class=\"ltx_text\" id=\"S8.T2.4.5.4.3.1\" style=\"font-size:90%;\">194</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S8.T2.4.5.4.4\"><span class=\"ltx_text\" id=\"S8.T2.4.5.4.4.1\" style=\"font-size:90%;\">~50%</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S8.T2.4.5.4.5\"><span class=\"ltx_text\" id=\"S8.T2.4.5.4.5.1\" style=\"font-size:90%;\">Timestamp-based</span></td>\n</tr>\n</tbody>\n</table>\n</figure>", | |
"perturb_sentence_id": 5, | |
"output": { | |
"perturbed_statement": "[paragraph id = 5] Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 4. In addition to utilizing non-public datasets, we generated a batch of simulated data featuring predefined lengths and prefix cache ratios for our tests.", | |
"perturbed_explanation": "The original statement provides an accurate depiction of the dataset preparation process: 1. The datasets were outlined in Table 2, and public datasets were utilized for generating simulated data. The altered statement introduces inaccuracies: 1. It incorrectly specifies Table 4 instead of Table 2 as the reference for the dataset outline. 2. It claims non-public datasets were utilized, which contradicts the original statement that mentions public datasets." | |
} | |
} | |
] |