[ { "path": "table_paper/2407.00079v3.json", "table_id": "1", "section": "4.2", "all_context": [ "Figure 5 illustrates the distribution of input and output lengths in our trace, with an average input length of 7,590 tokens and an average output length of 182 tokens.", "The average input-output ratio is approximately 720.", "It is important to note that this is only a representative pattern and not unanimous for all workloads, reflecting Kimi s renowned capability for superior long-context processing and understanding.", "We also conducted a simple cache policy analysis based on this trace, assuming a single global cache pool.", "Table 1 compares three cache strategies: LRU, LFU, and LengthAwareCache (similar to LFU but prioritizing eviction of cache blocks occurring later in requests) across different cache capacities.", "Increasing the cache capacity from 1,000 to 50,000 blocks boosts the cache hit ratio from 30% to 50%.", "Further capacity increases show minimal improvement.", "However, this should not be interpreted as an indication that larger caches are unnecessary, as the sample trace represents only a subset of real-world workloads.", "The required capacity should scale proportionally in actual scenarios.", "LRUCache performs best under this dataset s patterns, likely due to the temporal proximity in request utilization.", "Additionally, we observed a notable imbalance in cache block popularity, with over 50% of cache blocks remaining unused while certain blocks are accessed tens of thousands of times, as shown in Figure 6 .", "Replicating these hot blocks is essential to avoid transfer congestion.", "" ], "target_context_ids": [ 4, 5, 6, 7, 8, 9 ], "selected_paragraphs": [ "[paragraph id = 4] Table 1 compares three cache strategies: LRU, LFU, and LengthAwareCache (similar to LFU but prioritizing eviction of cache blocks occurring later in requests) across different cache capacities.", "[paragraph id = 5] Increasing the cache capacity from 1,000 to 50,000 blocks boosts the cache hit ratio from 30% to 50%.", "[paragraph id = 6] Further capacity increases show minimal improvement.", "[paragraph id = 7] However, this should not be interpreted as an indication that larger caches are unnecessary, as the sample trace represents only a subset of real-world workloads.", "[paragraph id = 8] The required capacity should scale proportionally in actual scenarios.", "[paragraph id = 9] LRUCache performs best under this dataset s patterns, likely due to the temporal proximity in request utilization." ], "table_html": "
\n
Table 1: Cache hit rates under different cache policies and capacities.
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Block capacityInf1000005000030000100001000
LRUCache0.510.510.500.480.400.30
LFUCache0.510.510.490.430.350.30
LengthAwareCache0.510.500.480.420.350.30
\n
", "perturb_sentence_id": [ 5, 6 ], "output": { "perturbed_statement": "[paragraph id = 5] Increasing the cache capacity from 1,000 to 60,000 blocks boosts the cache hit ratio from 30% to 60%. Further capacity increases show minimal improvement.", "perturbed_explanation": "1. The original explanation suggests that increasing cache capacity from 1,000 to 50,000 blocks improves the cache hit ratio from 30% to 50%, indicating a specific relationship between capacity and hit ratio. 2. The statement alters this relationship, incorrectly suggesting that increasing capacity to 60,000 blocks results in a 60% hit ratio. The context does not support this claim, and it falsely implies that increasing capacity beyond 50,000 continues to significantly improve the hit ratio, contrary to the noted minimal improvement beyond a certain point." } }, { "path": "table_paper/2407.00079v3.json", "table_id": "2", "section": "8.1", "all_context": [ "This section evaluates the end-to-end performance of Mooncake under different datasets and various workloads.", "As stated before, to protect proprietary information and facilitate reproducibility, all the experimental results reported in this paper are based on a dummy model that follows the same architecture as LLaMA2-70B.", "Testbed During the experiments, the system was deployed on a high-performance computing node cluster to test performance.", "Each node in the cluster is configured as follows: 8 NVIDIA-A800-SXM4-80GB GPUs, each with 80GB HBM, connected by NVLINK; equipped with RDMA network cards that supporting up to 800 Gbps of interconnect bandwidth between nodes.", "Each node deploys either a prefill instance or a decoding instance according to the startup parameter.", "Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 2 .", "In addition to utilizing public datasets, we generated a batch of simulated data featuring predefined lengths and prefix cache ratios for our experiments.", "To examine performance in real-world scenarios, we constructed a dataset consisting of 23,000 real request traces, each annotated with an arrival timestamp.", "Experiments involving real request traces were conducted by replaying these requests according to their actual arrival times.", "For other scenarios, we simulated requests using a Poisson arrival process and controlled the request rate through RPS (Requests per Second).", "Metric In the experiments, we focus on the throughput performance of various systems under defined SLOs.", "We measure the TTFT and TBT across different RPS rates, where a higher RPS signifies improved throughput.", "To assess whether the majority of requests satisfy the SLOs, we use the 90th percentile (P90) values of TTFT and TBT as the ultimate metrics.", "As mentioned in §2 , the thresholds for TTFT and TBT are set by multiplying the lowest observed RPS values by factors of 10 and 5, respectively.", "Exceeding these thresholds indicates a failure to meet the SLOs and the corresponding consumed resources are considered as wasted.", "For ease of comparison, we normalize all TTFT and TBT values against these upper limits, establishing a baseline of 1.0.", "Baseline We employ vLLM, one of the state-of-the-art open-source LLM serving systems, as our experimental baseline.", "vLLM incorporates continuous batching and PagedAttention technologies, significantly boosting inference throughput.", "Despite its strengths, vLLM s design, which couples the prefill and decoding stages of inference requests, can cause disruptions during decoding in scenarios involving long contexts.", "ArXiv Summarization L-Eval", "" ], "target_context_ids": [ 5, 6, 7, 8, 9 ], "selected_paragraphs": [ "[paragraph id = 5] Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 2 .", "[paragraph id = 6] In addition to utilizing public datasets, we generated a batch of simulated data featuring predefined lengths and prefix cache ratios for our experiments.", "[paragraph id = 7] To examine performance in real-world scenarios, we constructed a dataset consisting of 23,000 real request traces, each annotated with an arrival timestamp.", "[paragraph id = 8] Experiments involving real request traces were conducted by replaying these requests according to their actual arrival times.", "[paragraph id = 9] For other scenarios, we simulated requests using a Poisson arrival process and controlled the request rate through RPS (Requests per Second)." ], "table_html": "
\n
Table 2: Datasets used in the end-to-end experiment.
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
DatasetAvg Input LengthAvg Output LengthCache RatioArrival Pattern
\nArXiv Summarization [26]\n8088229~0%Poisson Process
\nL-Eval [27]\n1901972>80%Poisson Process
Simulated Data16k, 32k, 64k, 128k51250%Poisson Process
Real Data7955194~50%Timestamp-based
\n
", "perturb_sentence_id": [ 5, 6 ], "output": { "perturbed_statement": "[paragraph id = 5] Dataset and Workload Building upon previous research [15 , 8 , 14 ], we selected or designed the datasets as outlined in Table 2. In addition to utilizing public datasets, we generated a batch of simulated data featuring variable timestamps and postfix cache ratios for our experiments.", "perturbed_explanation": "In the context, the original explanation states that simulated data was generated with predefined lengths and prefix cache ratios. 1. The statement refers to 'variable timestamps' instead of 'predefined lengths,' which is a factual alteration because there is no mention of varying timestamps in the experiments described. 2. The statement refers to 'postfix cache ratios' instead of 'prefix cache ratios,' introducing another factual error because the term 'postfix cache ratios' is not consistent with the original focus on prefix cache management in the dataset." } } ]