Spaces:

a-ghorbani
/

ai-phone-leaderboard

Running

App Files Files Community

ai-phone-leaderboard / docs /ranking_system.md

agh123

feat(scoring): use model size as direct multiplier

19c7047 about 1 month ago

preview code

raw

history blame contribute delete

2.15 kB

	# Device Ranking System

	## Overview
	The ranking system implements a multi-dimensional approach to evaluate and compare device performance across different aspects of LLM (GGUF) model runs.

	## Scoring Algorithm

	### Standard Benchmark Conditions
	```python
	PP_CONFIG = 512 # Standard prompt processing token count
	TG_CONFIG = 128 # Standard token generation count

	# Component Weights
	TG_WEIGHT = 0.6 # Token generation weight (60%)
	PP_WEIGHT = 0.4 # Prompt processing weight (40%)
	```
	- PP given 40% weight as it's a one-time cost per prompt
	- TG given higher weight (60%) as it represents ongoing performance

	### Quantization Quality Factors
	```python
	QUANT_TIERS = {
	"F16": 1.0,
	"F32": 1.0,
	"Q8": 0.8,
	"Q6": 0.6,
	"Q5": 0.5,
	"Q4": 0.4,
	"Q3": 0.3,
	"Q2": 0.2,
	"Q1": 0.1,
	}
	```

	- Linear scale from 0.1 to 1.0 based on quantization level
	- F16/F32 are considered 1.0 (this skews the results a bit towards quantization)


	### Performance Score Formula
	The final performance score is calculated as follows:

	1. Base Performance:
	```
	base_score = (TG_speed * TG_WEIGHT + PP_speed * PP_WEIGHT)
	```

	2. Size and Quantization Adjustment:
	```
	# Direct multiplication by model size (in billions)
	performance_score = base_score * model_size * quant_factor
	```
	- Linear multiplier by model size

	3. Normalization:
	```
	normalized_score = (performance_score / max_performance_score) * 100
	```

	### Filtering
	- Only benchmarks matching standard conditions are considered:
	- PP_CONFIG (512) tokens for prompt processing
	- TG_CONFIG (128) tokens for token generation

	## Data Aggregation Strategy

	### Primary Grouping
	- Groups data by `Normalized Device ID` and `Platform`
	- Uses normalized device IDs to ensure consistent device identification across different submissions

	```python
	def normalize_device_id(device_info: dict) -> str:
	if device_info["systemName"].lower() == "ios":
	return f"iOS/{device_info['model']}"

	memory_tier = f"{device_info['totalMemory'] // (1024**3)}GB"
	return f"{device_info['brand']}/{device_info['model']}/{memory_tier}"
	```