[ { "path": "table_paper/2407.00023v2.json", "table_id": "1", "section": "4.4", "all_context": [ "We now provide a detailed analysis of Preble, including an ablation study and global scheduler scalability test.", "Because of H100 GPUs high cost and low availability, we run all experiments in this section with A6000 GPUs.", "Ablation study.", "To understand where the benefits of Preble come from, we evaluate Preble by incrementally adding features presented in Section 3 .", "We chose the tool use workload with a Zipf-1.1 popularity distribution among the prompts in the dataset to represent real-life skewed tool popularity.", "Other workloads and distributions benefit from a different set of techniques.", "We start with using the SGLang round-robin baseline.", "We first add the per-request E2 policy (Section 3.2 ), which results in an improvement on both average and p99 request latency because of E2 s dynamic load partitioning.", "We then add the post-assignment global rebalancing and autoscaling, which successfully balances out load even more, resulting in further improvement, especially with p99.", "Further adding the prefill/decode-aware handling results in more improvement on both average and p99, since it considers the current batch composition and is able to better utilize the GPU resources.", "Finally, we add the local-scheduler priority-based wait-queue scheduling (§3.3 ), which, as expected, improves p99 but not average latency, as its goal is fairness.", "Global scheduler performance and scalability.", "We measure the maximum throughput of Preble s global scheduler by sending a large number of requests (e.g., 50,000) at once to eliminate the effect of request arrival patterns and saturate the scheduler.", "Since the global prefix tree search is the most time-consuming task at the global scheduler, we test the Toolbench and VideoQA workloads, which have the most complex and simplest prefix tree structures in our five workloads.", "Preble s global scheduler achieves a processing rate of 245 and 2931 requests per second for Toolbench and VideoQA.", "We also measure the network processing speed and find it not to be the bottleneck.", "With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs.", "If accounting for prefill time or running bigger models, our scheduler would sustain even more GPUs.", "" ], "target_context_ids": [ 16 ], "selected_paragraphs": [ "[paragraph id = 16] With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
WorkloadPrompt LenOutput LenShared PrefixKeyPort.Req Share KeyPort.
in Promptin Prompt
Toolbench(1835, 742)(43, 16)(85%, 13%)(76%, 16%)(39, 64)
Embodied Agent(2285, 471)(16, 13)(97%, 14%)(76%, 12%)(48, 8)
Programming(3871, 1656)(190, 343)(97%, 7.4%)(78%, 13%)(126, 2157)
Video QA(9865, 5976)(4, 1.5)(88%, 32%)(99%, 0.2%)(8.6, 2)
LooGLE(23474, 6105)(16, 9.9)(91%, 24%)(94%, 15%)(18, 8.6)
\n
Table 1: LLM Prompt Properties Each cell except for number of requests shows (mean, standard deviation). Length represented using number of tokens. “KeyPort.” stands for Key Portion.\n
\n
", "perturb_sentence_id": 16, "output": { "perturbed_statement": "[paragraph id = 16] With the peak GPU processing rate (25-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs. If accounting for prefill time or running bigger models, our scheduler would sustain even fewer GPUs.", "perturbed_explanation": "The original explanation is as follows: The statement highlights the efficiency of a specific scheduling system. This efficiency is supported by its ability to sustain a substantial number of GPUs simultaneously, especially under varying operational scenarios. Firstly, the processing rate specified (30-150 tokens per second) allows for consistent parallel processing. Secondly, considerations such as prefill time or model complexity augment the scheduler's sustaining capacities. The altered statement introduces an incorrect processing rate range (25-150 tokens per second) and mistakenly claims fewer GPUs could be supported when using larger models. Both these modifications contradict the factual dependencies and capacities described, undermining the representation of the system's capabilities." } } ]