🎥 Project Video

Pokémon Battle Outcome Predictor

Given two Pokémon, who wins? A complete data-science pipeline — from raw battle logs to a deployed model — built as part of the AI & Data Science course at Reichman University.

Author: Roie Kugel · Course: AI & Data Science (Assignment #2) · Date: April 2026

TL;DR


Task	Predict whether Player 1's Pokémon wins a 1-v-1 battle
Dataset	`Diarmuid27/pokemon-battle-outcomes` — 50,000 battles · 25 features
Best model	Gradient Boosting Classifier
Headline result	94.72% accuracy · 0.978 AUC-ROC
Key insight	Relative feature gaps (P1 minus P2) carry the signal — raw per-Pokémon stats do not

1. The Research Question

Can we predict, with high confidence, the outcome of a Pokémon battle from each combatant's base stats alone?

The project answers this question in two passes.

The regression pass estimates the probability that Player 1 wins — a continuous score between 0 and 1 that is interpretable as confidence. The classification pass turns that probability into a hard Win / Loss decision so the model can be deployed as a real predictor.

Both passes share the same feature pipeline. The split between regression and classification is methodological, not data-driven, and is explained in §7 and §8.

2. The Dataset

Source. Diarmuid27/pokemon-battle-outcomes on Hugging Face.

Size. 50,000 battles × 25 columns.

Schema.

Group	Columns	Type
Identity	`Name_First`, `Name_Second`, `Type 1_`, `Type 2_`	text
Stats	`HP_`, `Attack_`, `Defense_`, `Sp. Atk_`, `Sp. Def_`, `Speed_`	numeric
Meta	`Generation_`, `Legendary_`	numeric / boolean
Battle	`First_pokemon`, `Second_pokemon`, `Winner`	numeric (IDs)

The suffixes _First and _Second denote Player 1's and Player 2's Pokémon respectively.

2.1 What we cleaned, and why

Issue found	Volume	Decision	Reasoning
`Name_First` or `Name_Second` is `NaN`	small subset of rows	Dropped these battles	A battle with an unknown combatant carries no meaningful signal — neither stats nor identity are recoverable.
`Type 2_*` is `NaN`	≈ 50% of rows	Filled with `"None"`	This is not missingness — single-type Pokémon have no second type by design. Replacing with a sentinel preserves the information rather than discarding half the data.
Column-name inconsistency (`Sp. Atk` vs `Sp. Def` vs `HP`)	structural	Normalized to lowercase	Prevents silent `KeyError`s during feature engineering.

After cleaning, every row in the dataset is a complete battle record.

3. The Story the Data Tells (EDA)

Five research questions guided the exploration. Each one is paired with a single chart and a one-paragraph answer.

3.1 Does Legendary status decide battles?

Answer — yes, decisively. When Player 1 fields a Legendary against a Common Pokémon, the win rate jumps to roughly 78%. The mirror scenario (Common vs. Legendary) collapses to 17%. Within the same tier (both Legendary, both Common) the win rate sits near 50% — a fair fight. Legendary status is the single most predictive feature in the raw dataset, and motivates the engineered legendary_advantage feature in §5.

3.2 Which stats correlate most strongly with winning?

Answer — Speed is the top single correlate, but no individual stat is enough on its own.

Note: total_first is a feature derived during EDA — it sums all six base stats of P1's Pokémon (hp_first + attack_first + defense_first + sp_atk_first + sp_def_first + speed_first). It is the standard "Base Stat Total" from the Pokémon games, used here as a single-number summary of overall power. total_second is the same sum for P2.

Reading the bottom row of the heatmap (correlation of each P1 feature with p1_won):

Feature	Correlation with `p1_won`
`speed_first`	0.49
`total_first`	0.34
`attack_first`	0.26
`sp_atk_first`	0.25
`sp_def_first`	0.16
`hp_first`	0.14
`defense_first`	0.06

Speed alone explains the most variance in the outcome of any single feature — consistent with the §3.3 finding that whoever strikes first usually wins. total_first is third (after speed and attack). The heatmap also shows heavy multicollinearity between the offensive stats (attack ↔ sp_atk = 0.40, total ↔ each stat = 0.58–0.74), meaning a linear model fed all six stats would suffer from redundant signals.

This motivates engineering total_diff = total_first − total_second (the power gap between the two combatants) and total_ratio = total_first / total_second (the proportional advantage) — not because total is the strongest individual signal, but because aggregating the six correlated raw stats into one compact, multicollinearity-resistant feature gives the model a clean overall-power signal without six overlapping inputs.

3.3 Does being faster than your opponent translate into victory?

Answer — yes, with a clean separation. When P1 wins, the distribution of speed_first − speed_second skews positive. When P1 loses, it skews negative. The dashed line at zero (equal speed) sits cleanly between the two distributions. Speed advantage is the second-most reliable single signal after total stats.

3.4 Does the Pokémon's generation matter?

Answer — no, and showing this matters. Across every generation gap from −5 to +5, the win rate stays inside a narrow band of ≈ 0.45 to 0.55 — within the noise zone. The right-hand panel confirms sample sizes are large enough for the result to be trusted. Generation is dropped from the feature set. Including features that don't predict the target is a common student mistake; ruling them out explicitly is part of rigorous modeling.

3.5 Does total power gap separate winners from losers?

Answer — yes, this is the strongest engineered signal. The two KDE distributions barely overlap. P1 wins cluster on the positive side of zero (P1 had more total stats), losses cluster on the negative side. The vertical line at zero acts as a natural decision boundary — and indeed, an "always pick the higher-total Pokémon" heuristic would already achieve roughly 80% accuracy. The model's job is to push past that ceiling on the borderline matchups, where total stats are close.

4. Phase 1 — The Baseline (Linear Regression on Raw Features)

Before any feature engineering, we establish an honest reference point: Linear Regression trained on the raw per-Pokémon stats only — no gaps, no totals, no clusters. Just the 14 raw inputs the dataset hands us (12 stats + 2 legendary flags).

Result: R² ≈ 0.47, MAE ≈ 0.30.

That's decent — but the diagnostics show why it is a ceiling, not an answer.

The scatter forms two horizontal bands rather than tracking the diagonal — a tell-tale sign that a linear model is the wrong shape for a binary target. Residuals are bimodal, not normal.

Even at the baseline level, the model already detects the right signals: the legendary flags carry the strongest coefficients, followed by speed. This previews what later models will confirm at scale — but a linear model can never fully express the non-linear physics of a battle.

This baseline motivates everything that follows: feature engineering (§5), clustering (§6), and the three real models (§7).

5. Feature Engineering — Rebuilding the Feature Space

Raw per-Pokémon stats give the model 14 numeric columns. But the underlying physics of a battle is not "How big is P1's HP?" — it is "How does P1's HP compare to P2's?" Feature engineering rebuilds the dataset around relative signals.

5.1 The four families of engineered features

Family	Features	What it captures
Gaps	`hp_gap`, `attack_gap`, `defense_gap`, `sp_atk_gap`, `sp_def_gap`, `speed_gap`	Per-stat relative advantage (P1 − P2).
Totals	`total_first`, `total_second`, `total_diff`, `total_ratio`	Overall power balance, both as a difference and as a proportional ratio.
Legendary	`legendary_advantage` ∈ {−1, 0, +1}	Replaces two raw flags with a single mismatch signal.
Cluster	`cluster_id`, `dist_to_centroid`, `prob_cluster_0..3`	Battle archetype — see §6.

The engineered set replaces the 14 raw features with 15 richer, model-ready signals.

5.2 Visualizing the lift from engineering alone

Before introducing tree models, the same Linear Regression is retrained on the engineered features to isolate the lift produced by feature engineering by itself:

total_diff, legendary_advantage, and speed_gap rise to the top of the coefficient ranking — exactly the three signals EDA pre-flagged as most informative. This validates the engineering choices before the heavy artillery arrives.

6. Clustering — Discovering Battle Archetypes

A Pokémon battle is rarely "just stats." A glass-cannon vs. tank match plays differently than two balanced rivals. To capture this, we run K-Means clustering on the gap features and turn cluster identity into model inputs.

6.1 Choosing K

The elbow method on inertia from K = 2 to 10 lands cleanly at K = 4.

6.2 Visualizing the clusters with PCA

Reducing the 9-dimensional engineered feature space to two principal components (PC1 + PC2 ≈ 70% of variance) shows the four clusters as distinct regions of the space:

6.3 Three features derived from clustering

cluster_id — which battle archetype does this match-up belong to?
dist_to_centroid — how typical is this battle within its archetype? Small distance = textbook example. Large distance = ambiguous, on the boundary between archetypes.
prob_cluster_k — soft membership scores derived from inverse distances. A battle split 0.45 / 0.40 / 0.10 / 0.05 is genuinely ambiguous; one at 0.95 / 0.02 / 0.02 / 0.01 is unambiguous. The model uses this to calibrate its own confidence.

7. Phase 2 — The Three Real Models

With the engineered + clustered features in hand, we move past linear models and train the three "real" candidates on the upgraded feature set:

#	Model	Trained in	Why this model
1	Logistic Regression	Part 4	The proper probabilistic linear model for a binary target — outputs always live in [0, 1], unlike Linear Regression. Direct upgrade over the baseline.
2	Random Forest	Part 5	Ensemble of independent decision trees. Captures non-linear interactions ("speed advantage matters most when total stats are close") that no linear model can see.
3	Gradient Boosting	Part 5	Sequential ensemble — each tree is trained specifically to fix the residuals of the previous trees. The strongest tabular performer in the lineup.

For the two tree models, the probability output (.predict_proba()[:, 1]) plays the role of the continuous regression score. This avoids the structural weakness of Linear Regression on binary targets and gives a properly calibrated value in [0, 1].

🏆 7.1 The Winner: Gradient Boosting

After training and 5-fold cross-validation:

Model	Accuracy	AUC-ROC
Logistic Regression	0.8857	0.9271
Random Forest	0.9450	0.9751
Gradient Boosting ⭐	0.9472	0.9780

Gradient Boosting wins with 94.72% accuracy and AUC-ROC = 0.978. That is roughly 6 percentage points above Logistic Regression and a clear margin over Random Forest at every threshold. Cross-validation confirms the result is stable, not a lucky split.

The winning model is saved at pokemon_gradient_boosting.pkl. Call .predict_proba(X)[:, 1] on it to obtain the probability in [0, 1].

8. Reframing as Classification

The same problem can also be framed as classification: not "what is P(P1 wins)?" but "will P1 win — yes or no?" The target p1_won is already binary (1 = win, 0 = loss), so the engineered features carry over directly.

Class balance check. Classes are nearly balanced (≈ 52.7% losses, 47.3% wins, ratio 1.11 : 1). Accuracy is therefore a trustworthy headline metric.

The same three models — Logistic Regression, Random Forest, Gradient Boosting — are retrained on the engineered features in classification mode. Gradient Boosting wins again and is saved as pokemon_classification_model.pkl (a self-contained bundle: model + scaler + feature names).

9. Diagnostic Deep Dive

9.1 ROC curves — comparing the three models

The Gradient Boosting curve hugs the top-left corner more tightly than the others — it makes fewer false positives at every threshold, not just at the default 0.5.

9.2 Feature importance — what the models actually learned

Both tree models converge on the same top three drivers: total_diff, speed_gap, and legendary_advantage — the three signals that EDA pre-flagged as the most informative. The clustering features (cluster_id, dist_to_centroid, prob_cluster_*) contribute incrementally; they do not dominate, but they sharpen the borderline cases.

9.3 Confusion matrices — where the models still fail

The remaining errors concentrate on near-tie battles — match-ups where total_diff is close to zero and legendary_advantage is also zero. These are genuinely ambiguous battles where even a domain expert could not be certain of the outcome. The model is not failing on easy cases; it is failing on unfair-to-call cases.

10. Why Gradient Boosting Wins

Sequential error correction. Random Forest votes among independent trees. Gradient Boosting builds each new tree specifically to fix the previous trees' mistakes — exactly what the borderline near-tie cases need.
Native non-linearity. Battle outcomes depend on interaction terms (speed advantage matters most when total stats are close). Tree splits encode these interactions without being told to look for them; linear models cannot.
Direct loss optimization. Gradient Boosting optimizes log-loss at every step — the loss function aligned with the binary target.

The improvement over Random Forest is small (~0.2 pp) because the engineered features already do most of the work. This is the correct order of operations: good features first, fancy models second.

11. How to Use This Model

The repo ships two pickle files, each suited to a different use-case. Both are powered by the same Gradient Boosting model, but they are packaged differently and answer slightly different questions.

11.1 What each model returns

File	Use-case	What you get back
`pokemon_gradient_boosting.pkl` (regression pass)	"How confident are we that P1 wins?"	A probability between 0 and 1 — e.g. `0.87` means "there is an 87% chance P1 wins."
`pokemon_classification_model.pkl` (classification pass)	"Will P1 win — yes or no?"	A binary label (`1` = P1 wins, `0` = P1 loses) — and, on request, the underlying probability. Also bundles the fitted `StandardScaler` and feature list, so it is fully self-contained.

In short: the regression file gives you a percentage; the classification file gives you a decision (and the percentage if you ask for it). Both files describe the same underlying battle prediction — they just expose the answer at different levels of granularity.

11.2 Loading the regression model (probability output)

import pickle

with open("pokemon_gradient_boosting.pkl", "rb") as f:
    reg_model = pickle.load(f)

# X must already be scaled with the same StandardScaler used in training.
# Returns the probability of P1 winning (0.0 — 1.0).
probability = reg_model.predict_proba(X)[:, 1]
print(f"Chance of P1 winning: {probability[0]:.1%}")

11.3 Loading the classification model (decision + probability)

import pickle, pandas as pd

# This file is a SELF-CONTAINED bundle: model + scaler + feature names.
with open("pokemon_classification_model.pkl", "rb") as f:
    bundle = pickle.load(f)

model    = bundle["model"]      # GradientBoostingClassifier
scaler   = bundle["scaler"]     # fitted StandardScaler from training
features = bundle["features"]   # ordered list of expected feature names

# Build a one-row DataFrame containing the 15 engineered features
new_battle = pd.DataFrame([{
    "hp_gap": 30, "attack_gap": -15, "defense_gap": 10,
    "sp_atk_gap": 25, "sp_def_gap": -5, "speed_gap": 40,
    "total_diff": 85, "total_ratio": 1.18,
    "legendary_advantage": 1,
    "cluster_id": 2, "dist_to_centroid": 0.84,
    "prob_cluster_0": 0.05, "prob_cluster_1": 0.10,
    "prob_cluster_2": 0.78, "prob_cluster_3": 0.07,
}])

X = scaler.transform(new_battle[features])

decision    = model.predict(X)[0]                # 0 or 1
probability = model.predict_proba(X)[0, 1]       # P(P1 wins)

print(f"Outcome: {'P1 WINS' if decision == 1 else 'P1 LOSES'} "
      f"(confidence: {probability:.1%})")

11.4 Which file should I use?

Both files contain the same Gradient Boosting Classifier — the difference is only how it is packaged:

File	What's inside	When to choose it
`pokemon_gradient_boosting.pkl`	A bare `GradientBoostingClassifier` object	When you already have your own `StandardScaler` and engineered features and want minimal overhead
`pokemon_classification_model.pkl`	A bundle dict: `{"model": …, "scaler": …, "features": […]}`	When you want a self-contained, ready-to-use predictor — the scaler and ordered feature list are bundled in

Either file supports both outputs:

Method on the loaded model	What it returns
`model.predict(X)`	`int64` array of `0`/`1` — the hard decision
`model.predict_proba(X)[:, 1]`	`float64` array in `[0, 1]` — the win probability
`model.predict_proba(X)`	`(n, 2)` array — `[P(loss), P(win)]` per row

For a yes/no answer, call .predict(). For a confidence score, call .predict_proba(X)[:, 1]. For a label that comes with its probability, call both.

12. Repository Contents

RKugel/pokemon_battle_Dor/
├── README.md                                       ← you are here
├── Rkugel_assigment_2_,_AI_data_science.ipynb      ← full notebook (EDA + modeling)
├── pokemon_gradient_boosting.pkl                   ← regression model (probability output)
├── pokemon_classification_model.pkl                ← classification model + scaler + features
├── 01_eda_legendary_scenarios.png
├── 02_eda_correlation_heatmap.png
├── 03_eda_speed_gap_violin.png
├── 04_eda_generation_gap_no_signal.png
├── 05_eda_total_stat_kde.png
├── 06_baseline_diagnostics.png
├── 07_baseline_coefficients.png
├── 08_kmeans_elbow.png
├── 09_pca_clusters.png
├── 10_engineered_coefficients.png
├── 11_roc_curves.png
├── 12_feature_importance.png
├── 13_confusion_matrices_part5.png
├── 14_class_balance.png
└── 15_confusion_matrices_classifiers.png

13. Citation

@misc{kugel2026pokemon,
  title  = {Pokémon Battle Outcome Predictor},
  author = {Kugel, Roie},
  year   = {2026},
  school = {Reichman University},
  course = {AI \& Data Science},
  note   = {Assignment \#2}
}

Acknowledgements

Dataset: Diarmuid27/pokemon-battle-outcomes on Hugging Face.
Toolchain: scikit-learn, pandas, numpy, matplotlib, seaborn.
Course: AI & Data Science, Reichman University, Spring 2026.

If you read this far — thank you. Now go battle.

Dedicated to "Dor", my cousin's son. Thanks for the endless Pokémon stories and for giving me the best idea for this project.

Downloads last month: -

Dataset used to train RKugel/pokemon_battle_Dor

Evaluation results

accuracy on pokemon-battle-outcomes
self-reported

0.947
roc_auc on pokemon-battle-outcomes
self-reported

0.978
f1 on pokemon-battle-outcomes
self-reported

0.945