[ { "path": "table_paper/2407.00088v1.json", "table_id": "2", "section": "5.1", "all_context": [ "As shown in Table 2 , we evaluate T-MAC across four distinct edge devices.", "These devices range from high-performance ones like M2-Ultra to less powerful ones like Raspberry Pi.", "The CPUs tested encompass Intel Core, Apple Silicon, and Cortex series.", "The operating systems include OSX, Linux, and Windows.", "This evaluation guarantees T-MAC s cross-platform compatibility and consistent performance across different instruction sets and various edge deployment scenarios.", "To evaluate the performance of T-MAC, we conduct extensive benchmarks using real-word low-bit LLMs and scenarios.", "For the kernel performance benchmark, we select matrix shapes derived from the Llama-2-7B and Llama-2-13B models, ensuring our evaluation reflects the practical demands.", "To conduct an end-to-end throughput test, we employed actual quantized models to demonstrate the practical efficacy of T-MAC across different bit-width configurations.", "Specifically, we employ 4-bit,3-bit,2-bit and 1-bit quantized Llama models, and also 1-bit and 1.58bit BitNet models that are trained from scratch.", "The 4-bit Llama models are from GPTQ (frantar2022gptq, ).", "The 3-bit and 2-bit Llama models are from BitDistiller (du2024bitdistiller, ).", "The 1-bit Llama models are from OneBit (xu2024onebit, ).", "We compared the performance of T-MAC with llama.cpp, a state-of-the-art implementation for LLM deployment on edge devices.", "We chose llama.cpp as the baseline for several compelling reasons.", "Firstly, llama.cpp represents the cutting-edge in LLM deployment on edge devices, featuring highly optimized kernel implementations tailored to each hardware platform.", "Its versatility and robust performance make it an ideal benchmark for assessing the efficacy of new methodologies.", "Additionally, llama.cpp is implemented in plain C/C++ without any dependencies, ensuring maximum compatibility and efficiency across diverse hardware configurations.", "For kernel performance benchmarks, we utilized the optimized kernels provided by llama.cpp as the baselines on the respective hardware devices.", "In our end-to-end throughput evaluations, we integrate the LUT-based kernels from T-MAC to llama.cpp and compare it with original llama.cpp.", "We perform both kernel-level and model-level measurement.", "To obtain precise and consistent kernel-level latency on CPU, we first perform a warmup of 10 iterations, followed by 100 runs to calculate an average.", "The warmup on M2-Ultra differs slightly from the others, requiring at least 1 second to maximize performance.", "To perform model-level latency, we integrate T-MAC into llama.cpp.", "We repeatedly generate 64 tokens for 20 iterations to evaluate token generation throughput.", "" ], "target_context_ids": [ 0, 1, 2, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 0] As shown in Table 2 , we evaluate T-MAC across four distinct edge devices.", "[paragraph id = 1] These devices range from high-performance ones like M2-Ultra to less powerful ones like Raspberry Pi.", "[paragraph id = 2] The CPUs tested encompass Intel Core, Apple Silicon, and Cortex series.", "[paragraph id = 3] The operating systems include OSX, Linux, and Windows.", "[paragraph id = 4] This evaluation guarantees T-MAC s cross-platform compatibility and consistent performance across different instruction sets and various edge deployment scenarios." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
DeviceProcessorPerformanceMax. Memory
Cores\nBandwidth (GB/s)
M2-UltraApple M2-Ultra16819.2
Raspberry Pi 5ARM Cortex-A76417.1
Jetson AGX OrinARM Cortex-A78AE12204.8
Surface Book 3Intel Core i5-1035G7458.2
\n
\n
Table 2. Hardware device specification.
\n
", "perturb_sentence_id": 0, "output": { "perturbed_statement": "[paragraph id = 0] As shown in Table 2 , we evaluate T-MAC across four distinct edge devices. These devices range from high-performance ones like Intel Xeon to less powerful ones like Cortex A357.", "perturbed_explanation": "The original explanation highlighted the evaluation of T-MAC across a range of devices, such as the high-performance M2-Ultra and the less powerful Raspberry Pi. Now, addressing the statement: 1. The high-performance evaluation platform mentioned is not an Intel Xeon; instead, it refers to M2-Ultra as per the context. 2. The less powerful device named Cortex A357 is not specified within the context provided in this form. Hence, the modified statement introduces inaccuracies concerning the specific devices discussed for evaluation." } }, { "path": "table_paper/2407.00088v1.json", "table_id": "4", "section": "5.6", "all_context": [ "There are two sources of error compared to conventional mpGEMM implementation: (a) table quantization, which is an algorithmic approximation included in our method, and (b) fast aggregation, whose error is introduced during the instruction execution within the fixed CPU architecture.", "We evaluate the impact of these two error sources at both kernel-level and model-level.", "We use the unquantized GEMV as the benchmark.", "The weights and activation of the GEMV are randomly generated FP16 values following a Gaussian Distribution, which are then quantized to 4-bit for execution by llama.cpp and T-MAC.", "The Normalized Mean Squared Error (NMSE) is then computed between the ground truth and the mpGEMV outputs.", "As shown in Table.", "3 , the NMSE difference between llama.cpp and T-MAC is negligible, indicating that the table quantization error is minimal.", "However, after applying fast aggregation, the NMSE increases to 2.5.", "To examine the impact of these errors on real-world models, we chose Llama-2-7B for testing.", "The models are the GGUF model converted from official Llama-2-7B weights for the un-quantized ground truth and the original llama-2-7b.Q4_0.gguf model (gguf-models, ) released with llama.cpp for mpGEMM.", "After integrating T-MAC into llama.cpp, we conduct the evaluation through the perplexity (llamacpp-perplexity, ) tool provided by llama.cpp.", "The evaluation is performed on three different tasks: WikiText-2 (merity2016pointer, ) and lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ) for perplexity (the lower the better), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better.", "As shown in Table 4 , on all of the three tasks, T-MAC delivers the same results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.", "After toggling on the fast aggregation, the perplexity increases by 0.4 and 1.0 respectively and the accuracy drops by 0.3%.", "In summary, T-MAC introduces negligible error to model inference while offering significant speedup.", "The fast aggregation can further enhance performance, but at the cost of model quality.", "We offer this as an option for users in scenarios that prioritize real-time performance and are less sensitive to accuracy.", "Without fast aggregation, T-MAC can still achieve substantial gain according to Figure 10 .", "In the future, we anticipate the error introduced by fast aggregation can be mitigated with straightforward optimizations of the CPU micro-architecture.", "" ], "target_context_ids": [ 11, 12, 13, 14, 15 ], "selected_paragraphs": [ "[paragraph id = 11] The evaluation is performed on three different tasks: WikiText-2 (merity2016pointer, ) and lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ) for perplexity (the lower the better), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better.", "[paragraph id = 12] As shown in Table 4 , on all of the three tasks, T-MAC delivers the same results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.", "[paragraph id = 13] After toggling on the fast aggregation, the perplexity increases by 0.4 and 1.0 respectively and the accuracy drops by 0.3%.", "[paragraph id = 14] In summary, T-MAC introduces negligible error to model inference while offering significant speedup.", "[paragraph id = 15] The fast aggregation can further enhance performance, but at the cost of model quality." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
FrameworkThroughputWikiText2lambada_openaiWinoGrande
Tokens/sec \nPPL \nPPL \nAcc. \n
Un-quantized3.795.8012.6571.0
llama.cpp5.655.9612.9570.8
T-MAC7.345.9612.9570.8
T-MAC (+FA)8.976.3813.9967.8
\n
\n
Table 4. End-to-end throughput and model quality of Llama-2-7B-4bit on M2-Ultra with single-thread. T-MAC improves throughput by 1.3 compared to llama.cpp with the same model quality. Fast Aggregation (FA) can further improve the throughput gain to 1.6, but the model quality will drop because of the numerical error of current CPU instructions.
\n
", "perturb_sentence_id": 11, "output": { "perturbed_statement": "[paragraph id = 11] The evaluation is performed on four different tasks: WikiText-2 (merity2016pointer, ), lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ), a custom summarization task (smith2020summarization, ), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better). As shown in Table 4 , on all of these tasks, T-MAC delivers the same results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.", "perturbed_explanation": "The original explanation states: 1. The evaluation mentioned involves only three tasks: WikiText-2, lambada_openai, and WinoGrande, each targeting specific metrics like perplexity or accuracy. 2. Upon reviewing the statement, it inaccurately mentions 'four different tasks' and includes a 'custom summarization task,' which does not align with the outlined tasks. This discrepancy renders the statement factually incorrect." } }, { "path": "table_paper/2407.00088v1.json", "table_id": "5", "section": "5.7", "all_context": [ "GPUs are widely used in LLM deployments.", "We compare T-MAC on CPU with llama.cpp on GPU to illustrate the efficiency of T-MAC.", "llama.cpp is the state-of-the-art LLM inference framework on both CPU and GPU.", "Figure 11 shows the mpGEMV kernel performance comparsion of T-MAC (CPU) and llama.cpp (GPU) on NVIDIA Jetson AGX Orin, a platform with ARM CPU and NVIDIA CUDA GPU.", "The kernel configurations are all from Llama-2-7B.", "T-MAC significantly outperforms GPU on W1A16 on all cases, while achieves comparable performance on W2A16 and W3A16.", "Although GPU performs better on higher bits and larger shape due to its powerful parallel computing capacity, this evaluation still shows huge potential of CPU-based LLM deployments on edge devices.", "Table 5 shows the end-to-end comparison of the Llama-2-7B-2bit model on NVIDIA Jetson AGX Orin.", "Without T-MAC, CPU only performs better than GPU in power, however, the energy consumption is still worse than GPU due to lower throughput.", "Compared to llama.cpp on CPU, T-MAC not only improves the throughput to 2.2, but also reduces the power to 69, resulting in 3.2 energy efficiency.", "Compared to llama.cpp on GPU, although T-MAC only achieves 78 throughput, T-MAC only needs 34 power, resulting in 2.3 energy efficiency.", "Note that Figure 11 shows T-MAC outperforms the GPU on the mpGEMV kernels.", "The reason why the throughput of T-MAC is still lower than that of GPU is due to the performance of kernels except mpGEMVs in llama.cpp on CPU.", "" ], "target_context_ids": [ 7, 8, 9, 10 ], "selected_paragraphs": [ "[paragraph id = 7] Table 5 shows the end-to-end comparison of the Llama-2-7B-2bit model on NVIDIA Jetson AGX Orin.", "[paragraph id = 8] Without T-MAC, CPU only performs better than GPU in power, however, the energy consumption is still worse than GPU due to lower throughput.", "[paragraph id = 9] Compared to llama.cpp on CPU, T-MAC not only improves the throughput to 2.2, but also reduces the power to 69, resulting in 3.2 energy efficiency.", "[paragraph id = 10] Compared to llama.cpp on GPU, although T-MAC only achieves 78 throughput, T-MAC only needs 34 power, resulting in 2.3 energy efficiency." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
FrameworkThroughputPowerEnergy
Tokens/secWJ/token
llama.cpp (CPU)7.0815.02.12
llama.cpp (GPU)20.0330.81.54
T-MAC (CPU)15.6210.40.66
\n
\n
Table 5. Llama-2-7B-2bit end-to-end inference throughput, power and energy comparisons on NVIDIA Jetson AGX Orin.
\n
", "perturb_sentence_id": 10, "output": { "perturbed_statement": "[paragraph id = 10] Compared to llama.cpp on GPU, although T-MAC only achieves 3.1 throughput, T-MAC only needs 34 power, resulting in 0.9 energy efficiency.Note that Figure 11 shows T-MAC outperforms the GPU on the mpGEMV kernels.", "perturbed_explanation": "Original Explanation: The provided statement was incorrect because (1) it mentions GPU while the original context discusses CPU; (2) it presents contradictory values for throughput, power, and energy efficiency than those explained, which were throughput at 2.2, power at 69, and energy efficiency at 3.2. (3) It alters specific claims about Figure 11 inconsistently. Continuing, these alterations contradict the defined and observed facts shown. T-MAC’s efficiency improvements are quantifiable and linked to the CPU benchmarks as highlighted." } } ]