Spaces:
Sleeping
Sleeping
[ | |
{ | |
"path": "table_paper/2407.00023v2.json", | |
"table_id": "1", | |
"section": "4.4", | |
"all_context": [ | |
"We now provide a detailed analysis of Preble, including an ablation study and global scheduler scalability test.", | |
"Because of H100 GPUs high cost and low availability, we run all experiments in this section with A6000 GPUs.", | |
"Ablation study.", | |
"To understand where the benefits of Preble come from, we evaluate Preble by incrementally adding features presented in Section 3 .", | |
"We chose the tool use workload with a Zipf-1.1 popularity distribution among the prompts in the dataset to represent real-life skewed tool popularity.", | |
"Other workloads and distributions benefit from a different set of techniques.", | |
"We start with using the SGLang round-robin baseline.", | |
"We first add the per-request E2 policy (Section 3.2 ), which results in an improvement on both average and p99 request latency because of E2 s dynamic load partitioning.", | |
"We then add the post-assignment global rebalancing and autoscaling, which successfully balances out load even more, resulting in further improvement, especially with p99.", | |
"Further adding the prefill/decode-aware handling results in more improvement on both average and p99, since it considers the current batch composition and is able to better utilize the GPU resources.", | |
"Finally, we add the local-scheduler priority-based wait-queue scheduling (§3.3 ), which, as expected, improves p99 but not average latency, as its goal is fairness.", | |
"Global scheduler performance and scalability.", | |
"We measure the maximum throughput of Preble s global scheduler by sending a large number of requests (e.g., 50,000) at once to eliminate the effect of request arrival patterns and saturate the scheduler.", | |
"Since the global prefix tree search is the most time-consuming task at the global scheduler, we test the Toolbench and VideoQA workloads, which have the most complex and simplest prefix tree structures in our five workloads.", | |
"Preble s global scheduler achieves a processing rate of 245 and 2931 requests per second for Toolbench and VideoQA.", | |
"We also measure the network processing speed and find it not to be the bottleneck.", | |
"With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs.", | |
"If accounting for prefill time or running bigger models, our scheduler would sustain even more GPUs.", | |
"" | |
], | |
"target_context_ids": [ | |
16 | |
], | |
"selected_paragraphs": [ | |
"[paragraph id = 16] With the peak GPU processing rate (30-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs." | |
], | |
"table_html": "<figure class=\"ltx_table\" id=\"A1.T1\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A1.T1.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A1.T1.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_l ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.1.1\">Workload</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.2.1\">Prompt Len</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.3.1\">Output Len</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.4.1\">Shared Prefix</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.5\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.5.1\">KeyPort.</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t\" id=\"A1.T1.1.1.1.6\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.1.1.6.1\">Req Share KeyPort.</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T1.1.2.2\">\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_l ltx_border_r\" id=\"A1.T1.1.2.2.1\"></th>\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r\" id=\"A1.T1.1.2.2.2\"></th>\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r\" id=\"A1.T1.1.2.2.3\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"A1.T1.1.2.2.4\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.2.2.4.1\">in Prompt</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r\" id=\"A1.T1.1.2.2.5\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.2.2.5.1\">in Prompt</span></th>\n<th class=\"ltx_td ltx_th ltx_th_column ltx_border_r\" id=\"A1.T1.1.2.2.6\"></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A1.T1.1.3.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_l ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.3.1.1.1\">Toolbench</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.2\">(1835, 742)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.3\">(43, 16)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.4\">(85%, 13%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.5\">(76%, 16%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r ltx_border_t\" id=\"A1.T1.1.3.1.6\">(39, 64)</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T1.1.4.2\">\n<td class=\"ltx_td ltx_align_center ltx_border_l ltx_border_r\" id=\"A1.T1.1.4.2.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.4.2.1.1\">Embodied Agent</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.4.2.2\">(2285, 471)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.4.2.3\">(16, 13)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.4.2.4\">(97%, 14%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.4.2.5\">(76%, 12%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.4.2.6\">(48, 8)</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T1.1.5.3\">\n<td class=\"ltx_td ltx_align_center ltx_border_l ltx_border_r\" id=\"A1.T1.1.5.3.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.5.3.1.1\">Programming</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.5.3.2\">(3871, 1656)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.5.3.3\">(190, 343)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.5.3.4\">(97%, 7.4%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.5.3.5\">(78%, 13%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.5.3.6\">(126, 2157)</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T1.1.6.4\">\n<td class=\"ltx_td ltx_align_center ltx_border_l ltx_border_r\" id=\"A1.T1.1.6.4.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.6.4.1.1\">Video QA</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.6.4.2\">(9865, 5976)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.6.4.3\">(4, 1.5)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.6.4.4\">(88%, 32%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.6.4.5\">(99%, 0.2%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_r\" id=\"A1.T1.1.6.4.6\">(8.6, 2)</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T1.1.7.5\">\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_l ltx_border_r\" id=\"A1.T1.1.7.5.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.1.7.5.1.1\">LooGLE</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_r\" id=\"A1.T1.1.7.5.2\">(23474, 6105)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_r\" id=\"A1.T1.1.7.5.3\">(16, 9.9)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_r\" id=\"A1.T1.1.7.5.4\">(91%, 24%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_r\" id=\"A1.T1.1.7.5.5\">(94%, 15%)</td>\n<td class=\"ltx_td ltx_align_center ltx_border_b ltx_border_r\" id=\"A1.T1.1.7.5.6\">(18, 8.6)</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span><span class=\"ltx_text ltx_font_bold\" id=\"A1.T1.4.1\">LLM Prompt Properties</span> <span class=\"ltx_text ltx_font_italic\" id=\"A1.T1.5.2\">Each cell except for number of requests shows (mean, standard deviation). Length represented using number of tokens. “KeyPort.” stands for Key Portion.\n</span></figcaption>\n</figure>", | |
"perturb_sentence_id": 16, | |
"output": { | |
"perturbed_statement": "[paragraph id = 16] With the peak GPU processing rate (25-150 tokens per second decoding speed with Mistral 7B on A100) and our workloads output length (Table 1 ), one Preble global scheduler can sustain at least 70 to 391 concurrent A100 GPUs. If accounting for prefill time or running bigger models, our scheduler would sustain even fewer GPUs.", | |
"perturbed_explanation": "The original explanation is as follows: The statement highlights the efficiency of a specific scheduling system. This efficiency is supported by its ability to sustain a substantial number of GPUs simultaneously, especially under varying operational scenarios. Firstly, the processing rate specified (30-150 tokens per second) allows for consistent parallel processing. Secondly, considerations such as prefill time or model complexity augment the scheduler's sustaining capacities. The altered statement introduces an incorrect processing rate range (25-150 tokens per second) and mistakenly claims fewer GPUs could be supported when using larger models. Both these modifications contradict the factual dependencies and capacities described, undermining the representation of the system's capabilities." | |
} | |
} | |
] |