Device Ranking System

Overview

The ranking system implements a multi-dimensional approach to evaluate and compare device performance across different aspects of LLM (GGUF) model runs.

Scoring Algorithm

Standard Benchmark Conditions

PP_CONFIG = 512  # Standard prompt processing token count
TG_CONFIG = 128  # Standard token generation count

# Component Weights
TG_WEIGHT = 0.6  # Token generation weight (60%) 
PP_WEIGHT = 0.4  # Prompt processing weight (40%)

PP given 40% weight as it's a one-time cost per prompt
TG given higher weight (60%) as it represents ongoing performance

Quantization Quality Factors

QUANT_TIERS = {
    "F16": 1.0,
    "F32": 1.0,
    "Q8": 0.8,
    "Q6": 0.6,
    "Q5": 0.5,
    "Q4": 0.4,
    "Q3": 0.3,
    "Q2": 0.2, 
    "Q1": 0.1, 
}

Linear scale from 0.1 to 1.0 based on quantization level
F16/F32 are considered 1.0 (this skews the results a bit towards quantization)

Performance Score Formula

The final performance score is calculated as follows:

Base Performance:

base_score = (TG_speed * TG_WEIGHT + PP_speed * PP_WEIGHT)

Size and Quantization Adjustment:

# Direct multiplication by model size (in billions)
performance_score = base_score * model_size * quant_factor

Linear multiplier by model size

Normalization:

normalized_score = (performance_score / max_performance_score) * 100

Filtering

Only benchmarks matching standard conditions are considered:
- PP_CONFIG (512) tokens for prompt processing
- TG_CONFIG (128) tokens for token generation

Data Aggregation Strategy

Primary Grouping

Groups data by Normalized Device ID and Platform
Uses normalized device IDs to ensure consistent device identification across different submissions

def normalize_device_id(device_info: dict) -> str:
    if device_info["systemName"].lower() == "ios":
        return f"iOS/{device_info['model']}"

    memory_tier = f"{device_info['totalMemory'] // (1024**3)}GB"
    return f"{device_info['brand']}/{device_info['model']}/{memory_tier}"