[ { "path": "table_paper/2407.00023v2.json", "table_id": "1", "section": "4.4", "all_context": [ "We now provide a detailed analysis of Preble, including an ablation study and global scheduler scalability test.", "Because of H100 GPUs high cost and low availability, we run all experiments in this section with A6000 GPUs.", "Ablation study.", "To understand where the benefits of Preble come from, we evaluate Preble by incrementally adding features presented in Section 3 .", "We chose the tool use workload with a Zipf-1.1 popularity distribution among the prompts in the dataset to represent real-life skewed tool popularity.", "Other workloads and distributions benefit from a different set of techniques.", "We start with using the SGLang round-robin baseline.", "We first add the per-request E2 policy (Section 3.2 ), which results in an improvement on both average and p99 request latency because of E2 s dynamic load partitioning.", "We then add the post-assignment global rebalancing and autoscaling, which successfully balances out load even more, resulting in further improvement, especially with p99.", "Further adding the prefill/decode-aware handling results in more improvement on both average and p99, since it considers the current batch composition and is able to better utilize the GPU resources.", "Finally, we add the local-scheduler priority-based wait-queue scheduling (§3.3 ), which, as expected, improves p99 but not average latency, as its goal is fairness.", "Global scheduler performance and scalability.", "We measure the maximum throughput of Preble s global scheduler by sending a large number of requests (e.g., 50,000) at once to eliminate the effect of request arrival patterns and saturate the scheduler.", "Since the global prefix tree search is the most time-consuming task at the global scheduler, we test the Toolbench and VideoQA workloads, which have the most complex and simplest prefix tree structures in our five workloads.", "Preble s global scheduler achieves a processing rate of 245 and 2931 requests per second for Toolbench and VideoQA.", "We also measure the network processing speed and find it not to be the bottleneck.", "With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs.", "If accounting for prefill time or running bigger models, our scheduler would sustain even more GPUs.", "" ], "target_context_ids": [ 16 ], "selected_paragraphs": [ "[paragraph id = 16] With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
WorkloadPrompt LenOutput LenShared PrefixKeyPort.Req Share KeyPort.
in Promptin Prompt
Toolbench(1835, 742)(43, 16)(85%, 13%)(76%, 16%)(39, 64)
Embodied Agent(2285, 471)(16, 13)(97%, 14%)(76%, 12%)(48, 8)
Programming(3871, 1656)(190, 343)(97%, 7.4%)(78%, 13%)(126, 2157)
Video QA(9865, 5976)(4, 1.5)(88%, 32%)(99%, 0.2%)(8.6, 2)
LooGLE(23474, 6105)(16, 9.9)(91%, 24%)(94%, 15%)(18, 8.6)
\n
Table 1: LLM Prompt Properties Each cell except for number of requests shows (mean, standard deviation). Length represented using number of tokens. “KeyPort.” stands for Key Portion.\n
\n
", "perturb_sentence_id": [ 16, 17 ], "output": { "perturbed_statement": "[paragraph id = 16] With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 191 concurrent A100 GPUs. If accounting for prefill time or running bigger models, our scheduler would sustain even more GPUs.", "perturbed_explanation": "The original explanation explains the capacity of the Preble global scheduler in terms of concurrent A100 GPUs it can sustain. By reducing the upper limit of concurrent A100 GPUs from 391 to 191 in the statement, it underestimates the scheduler's actual capacity. 1. The original range of 70 to 391 concurrent GPUs accurately represents the scheduler's ability based on the provided processing rate and workload details. 2. By stating that the scheduler can only sustain a maximum of 191 concurrent GPUs, the statement incorrectly diminishes its true capability, which can impact understanding of the scheduler's efficiency and performance scalability." } } ]