πŸŽ₯ Project Video

▢️ Watch the project walkthrough video


PokΓ©mon Battle Outcome Predictor

Given two PokΓ©mon, who wins? A complete data-science pipeline β€” from raw battle logs to a deployed model β€” built as part of the AI & Data Science course at Reichman University.

Author: Roie Kugel Β· Course: AI & Data Science (Assignment #2) Β· Date: April 2026


TL;DR

Task Predict whether Player 1's PokΓ©mon wins a 1-v-1 battle
Dataset Diarmuid27/pokemon-battle-outcomes β€” 50,000 battles Β· 25 features
Best model Gradient Boosting Classifier
Headline result 94.72% accuracy Β· 0.978 AUC-ROC
Key insight Relative feature gaps (P1 minus P2) carry the signal β€” raw per-PokΓ©mon stats do not

1. The Research Question

Can we predict, with high confidence, the outcome of a PokΓ©mon battle from each combatant's base stats alone?

The project answers this question in two passes.

The regression pass estimates the probability that Player 1 wins β€” a continuous score between 0 and 1 that is interpretable as confidence. The classification pass turns that probability into a hard Win / Loss decision so the model can be deployed as a real predictor.

Both passes share the same feature pipeline. The split between regression and classification is methodological, not data-driven, and is explained in Β§7 and Β§8.


2. The Dataset

Source. Diarmuid27/pokemon-battle-outcomes on Hugging Face.

Size. 50,000 battles Γ— 25 columns.

Schema.

Group Columns Type
Identity Name_First, Name_Second, Type 1_*, Type 2_* text
Stats HP_*, Attack_*, Defense_*, Sp. Atk_*, Sp. Def_*, Speed_* numeric
Meta Generation_*, Legendary_* numeric / boolean
Battle First_pokemon, Second_pokemon, Winner numeric (IDs)

The suffixes _First and _Second denote Player 1's and Player 2's PokΓ©mon respectively.

2.1 What we cleaned, and why

Issue found Volume Decision Reasoning
Name_First or Name_Second is NaN small subset of rows Dropped these battles A battle with an unknown combatant carries no meaningful signal β€” neither stats nor identity are recoverable.
Type 2_* is NaN β‰ˆ 50% of rows Filled with "None" This is not missingness β€” single-type PokΓ©mon have no second type by design. Replacing with a sentinel preserves the information rather than discarding half the data.
Column-name inconsistency (Sp. Atk vs Sp. Def vs HP) structural Normalized to lowercase Prevents silent KeyErrors during feature engineering.

After cleaning, every row in the dataset is a complete battle record.


3. The Story the Data Tells (EDA)

Five research questions guided the exploration. Each one is paired with a single chart and a one-paragraph answer.

3.1 Does Legendary status decide battles?

Win rate by legendary scenario

Answer β€” yes, decisively. When Player 1 fields a Legendary against a Common PokΓ©mon, the win rate jumps to roughly 78%. The mirror scenario (Common vs. Legendary) collapses to 17%. Within the same tier (both Legendary, both Common) the win rate sits near 50% β€” a fair fight. Legendary status is the single most predictive feature in the raw dataset, and motivates the engineered legendary_advantage feature in Β§5.


3.2 Which stats correlate most strongly with winning?

Correlation heatmap of P1 stats with win outcome

Answer β€” Speed is the top single correlate, but no individual stat is enough on its own.

Note: total_first is a feature derived during EDA β€” it sums all six base stats of P1's PokΓ©mon (hp_first + attack_first + defense_first + sp_atk_first + sp_def_first + speed_first). It is the standard "Base Stat Total" from the PokΓ©mon games, used here as a single-number summary of overall power. total_second is the same sum for P2.

Reading the bottom row of the heatmap (correlation of each P1 feature with p1_won):

Feature Correlation with p1_won
speed_first 0.49
total_first 0.34
attack_first 0.26
sp_atk_first 0.25
sp_def_first 0.16
hp_first 0.14
defense_first 0.06

Speed alone explains the most variance in the outcome of any single feature β€” consistent with the Β§3.3 finding that whoever strikes first usually wins. total_first is third (after speed and attack). The heatmap also shows heavy multicollinearity between the offensive stats (attack ↔ sp_atk = 0.40, total ↔ each stat = 0.58–0.74), meaning a linear model fed all six stats would suffer from redundant signals.

This motivates engineering total_diff = total_first βˆ’ total_second (the power gap between the two combatants) and total_ratio = total_first / total_second (the proportional advantage) β€” not because total is the strongest individual signal, but because aggregating the six correlated raw stats into one compact, multicollinearity-resistant feature gives the model a clean overall-power signal without six overlapping inputs.


3.3 Does being faster than your opponent translate into victory?

Speed gap distribution by outcome

Answer β€” yes, with a clean separation. When P1 wins, the distribution of speed_first βˆ’ speed_second skews positive. When P1 loses, it skews negative. The dashed line at zero (equal speed) sits cleanly between the two distributions. Speed advantage is the second-most reliable single signal after total stats.


3.4 Does the PokΓ©mon's generation matter?

Generation gap shows no predictive power

Answer β€” no, and showing this matters. Across every generation gap from βˆ’5 to +5, the win rate stays inside a narrow band of β‰ˆ 0.45 to 0.55 β€” within the noise zone. The right-hand panel confirms sample sizes are large enough for the result to be trusted. Generation is dropped from the feature set. Including features that don't predict the target is a common student mistake; ruling them out explicitly is part of rigorous modeling.


3.5 Does total power gap separate winners from losers?

Total stat difference KDE for wins vs losses

Answer β€” yes, this is the strongest engineered signal. The two KDE distributions barely overlap. P1 wins cluster on the positive side of zero (P1 had more total stats), losses cluster on the negative side. The vertical line at zero acts as a natural decision boundary β€” and indeed, an "always pick the higher-total PokΓ©mon" heuristic would already achieve roughly 80% accuracy. The model's job is to push past that ceiling on the borderline matchups, where total stats are close.


4. Phase 1 β€” The Baseline (Linear Regression on Raw Features)

Before any feature engineering, we establish an honest reference point: Linear Regression trained on the raw per-PokΓ©mon stats only β€” no gaps, no totals, no clusters. Just the 14 raw inputs the dataset hands us (12 stats + 2 legendary flags).

Result: RΒ² β‰ˆ 0.47, MAE β‰ˆ 0.30.

That's decent β€” but the diagnostics show why it is a ceiling, not an answer.

Baseline regression diagnostics

The scatter forms two horizontal bands rather than tracking the diagonal β€” a tell-tale sign that a linear model is the wrong shape for a binary target. Residuals are bimodal, not normal.

Baseline linear regression coefficients

Even at the baseline level, the model already detects the right signals: the legendary flags carry the strongest coefficients, followed by speed. This previews what later models will confirm at scale β€” but a linear model can never fully express the non-linear physics of a battle.

This baseline motivates everything that follows: feature engineering (Β§5), clustering (Β§6), and the three real models (Β§7).


5. Feature Engineering β€” Rebuilding the Feature Space

Raw per-PokΓ©mon stats give the model 14 numeric columns. But the underlying physics of a battle is not "How big is P1's HP?" β€” it is "How does P1's HP compare to P2's?" Feature engineering rebuilds the dataset around relative signals.

5.1 The four families of engineered features

Family Features What it captures
Gaps hp_gap, attack_gap, defense_gap, sp_atk_gap, sp_def_gap, speed_gap Per-stat relative advantage (P1 βˆ’ P2).
Totals total_first, total_second, total_diff, total_ratio Overall power balance, both as a difference and as a proportional ratio.
Legendary legendary_advantage ∈ {βˆ’1, 0, +1} Replaces two raw flags with a single mismatch signal.
Cluster cluster_id, dist_to_centroid, prob_cluster_0..3 Battle archetype β€” see Β§6.

The engineered set replaces the 14 raw features with 15 richer, model-ready signals.

5.2 Visualizing the lift from engineering alone

Before introducing tree models, the same Linear Regression is retrained on the engineered features to isolate the lift produced by feature engineering by itself:

Engineered model coefficients

total_diff, legendary_advantage, and speed_gap rise to the top of the coefficient ranking β€” exactly the three signals EDA pre-flagged as most informative. This validates the engineering choices before the heavy artillery arrives.


6. Clustering β€” Discovering Battle Archetypes

A PokΓ©mon battle is rarely "just stats." A glass-cannon vs. tank match plays differently than two balanced rivals. To capture this, we run K-Means clustering on the gap features and turn cluster identity into model inputs.

6.1 Choosing K

The elbow method on inertia from K = 2 to 10 lands cleanly at K = 4.

K-Means elbow method

6.2 Visualizing the clusters with PCA

Reducing the 9-dimensional engineered feature space to two principal components (PC1 + PC2 β‰ˆ 70% of variance) shows the four clusters as distinct regions of the space:

PCA visualization of K-Means clusters

6.3 Three features derived from clustering

  • cluster_id β€” which battle archetype does this match-up belong to?
  • dist_to_centroid β€” how typical is this battle within its archetype? Small distance = textbook example. Large distance = ambiguous, on the boundary between archetypes.
  • prob_cluster_k β€” soft membership scores derived from inverse distances. A battle split 0.45 / 0.40 / 0.10 / 0.05 is genuinely ambiguous; one at 0.95 / 0.02 / 0.02 / 0.01 is unambiguous. The model uses this to calibrate its own confidence.

7. Phase 2 β€” The Three Real Models

With the engineered + clustered features in hand, we move past linear models and train the three "real" candidates on the upgraded feature set:

# Model Trained in Why this model
1 Logistic Regression Part 4 The proper probabilistic linear model for a binary target β€” outputs always live in [0, 1], unlike Linear Regression. Direct upgrade over the baseline.
2 Random Forest Part 5 Ensemble of independent decision trees. Captures non-linear interactions ("speed advantage matters most when total stats are close") that no linear model can see.
3 Gradient Boosting Part 5 Sequential ensemble β€” each tree is trained specifically to fix the residuals of the previous trees. The strongest tabular performer in the lineup.

For the two tree models, the probability output (.predict_proba()[:, 1]) plays the role of the continuous regression score. This avoids the structural weakness of Linear Regression on binary targets and gives a properly calibrated value in [0, 1].

πŸ† 7.1 The Winner: Gradient Boosting

After training and 5-fold cross-validation:

Model Accuracy AUC-ROC
Logistic Regression 0.8857 0.9271
Random Forest 0.9450 0.9751
Gradient Boosting ⭐ 0.9472 0.9780

Gradient Boosting wins with 94.72% accuracy and AUC-ROC = 0.978. That is roughly 6 percentage points above Logistic Regression and a clear margin over Random Forest at every threshold. Cross-validation confirms the result is stable, not a lucky split.

The winning model is saved at pokemon_gradient_boosting.pkl. Call .predict_proba(X)[:, 1] on it to obtain the probability in [0, 1].


8. Reframing as Classification

The same problem can also be framed as classification: not "what is P(P1 wins)?" but "will P1 win β€” yes or no?" The target p1_won is already binary (1 = win, 0 = loss), so the engineered features carry over directly.

Class balance check. Classes are nearly balanced (β‰ˆ 52.7% losses, 47.3% wins, ratio 1.11 : 1). Accuracy is therefore a trustworthy headline metric.

Class balance

The same three models β€” Logistic Regression, Random Forest, Gradient Boosting β€” are retrained on the engineered features in classification mode. Gradient Boosting wins again and is saved as pokemon_classification_model.pkl (a self-contained bundle: model + scaler + feature names).


9. Diagnostic Deep Dive

9.1 ROC curves β€” comparing the three models

ROC curves of all three classifiers

The Gradient Boosting curve hugs the top-left corner more tightly than the others β€” it makes fewer false positives at every threshold, not just at the default 0.5.

9.2 Feature importance β€” what the models actually learned

Random Forest and Gradient Boosting feature importance

Both tree models converge on the same top three drivers: total_diff, speed_gap, and legendary_advantage β€” the three signals that EDA pre-flagged as the most informative. The clustering features (cluster_id, dist_to_centroid, prob_cluster_*) contribute incrementally; they do not dominate, but they sharpen the borderline cases.

9.3 Confusion matrices β€” where the models still fail

Confusion matrices of all three classifiers

The remaining errors concentrate on near-tie battles β€” match-ups where total_diff is close to zero and legendary_advantage is also zero. These are genuinely ambiguous battles where even a domain expert could not be certain of the outcome. The model is not failing on easy cases; it is failing on unfair-to-call cases.


10. Why Gradient Boosting Wins

  1. Sequential error correction. Random Forest votes among independent trees. Gradient Boosting builds each new tree specifically to fix the previous trees' mistakes β€” exactly what the borderline near-tie cases need.
  2. Native non-linearity. Battle outcomes depend on interaction terms (speed advantage matters most when total stats are close). Tree splits encode these interactions without being told to look for them; linear models cannot.
  3. Direct loss optimization. Gradient Boosting optimizes log-loss at every step β€” the loss function aligned with the binary target.

The improvement over Random Forest is small (~0.2 pp) because the engineered features already do most of the work. This is the correct order of operations: good features first, fancy models second.


11. How to Use This Model

The repo ships two pickle files, each suited to a different use-case. Both are powered by the same Gradient Boosting model, but they are packaged differently and answer slightly different questions.

11.1 What each model returns

File Use-case What you get back
pokemon_gradient_boosting.pkl (regression pass) "How confident are we that P1 wins?" A probability between 0 and 1 β€” e.g. 0.87 means "there is an 87% chance P1 wins."
pokemon_classification_model.pkl (classification pass) "Will P1 win β€” yes or no?" A binary label (1 = P1 wins, 0 = P1 loses) β€” and, on request, the underlying probability. Also bundles the fitted StandardScaler and feature list, so it is fully self-contained.

In short: the regression file gives you a percentage; the classification file gives you a decision (and the percentage if you ask for it). Both files describe the same underlying battle prediction β€” they just expose the answer at different levels of granularity.

11.2 Loading the regression model (probability output)

import pickle

with open("pokemon_gradient_boosting.pkl", "rb") as f:
    reg_model = pickle.load(f)

# X must already be scaled with the same StandardScaler used in training.
# Returns the probability of P1 winning (0.0 β€” 1.0).
probability = reg_model.predict_proba(X)[:, 1]
print(f"Chance of P1 winning: {probability[0]:.1%}")

11.3 Loading the classification model (decision + probability)

import pickle, pandas as pd

# This file is a SELF-CONTAINED bundle: model + scaler + feature names.
with open("pokemon_classification_model.pkl", "rb") as f:
    bundle = pickle.load(f)

model    = bundle["model"]      # GradientBoostingClassifier
scaler   = bundle["scaler"]     # fitted StandardScaler from training
features = bundle["features"]   # ordered list of expected feature names

# Build a one-row DataFrame containing the 15 engineered features
new_battle = pd.DataFrame([{
    "hp_gap": 30, "attack_gap": -15, "defense_gap": 10,
    "sp_atk_gap": 25, "sp_def_gap": -5, "speed_gap": 40,
    "total_diff": 85, "total_ratio": 1.18,
    "legendary_advantage": 1,
    "cluster_id": 2, "dist_to_centroid": 0.84,
    "prob_cluster_0": 0.05, "prob_cluster_1": 0.10,
    "prob_cluster_2": 0.78, "prob_cluster_3": 0.07,
}])

X = scaler.transform(new_battle[features])

decision    = model.predict(X)[0]                # 0 or 1
probability = model.predict_proba(X)[0, 1]       # P(P1 wins)

print(f"Outcome: {'P1 WINS' if decision == 1 else 'P1 LOSES'} "
      f"(confidence: {probability:.1%})")

11.4 Which file should I use?

Both files contain the same Gradient Boosting Classifier β€” the difference is only how it is packaged:

File What's inside When to choose it
pokemon_gradient_boosting.pkl A bare GradientBoostingClassifier object When you already have your own StandardScaler and engineered features and want minimal overhead
pokemon_classification_model.pkl A bundle dict: {"model": …, "scaler": …, "features": […]} When you want a self-contained, ready-to-use predictor β€” the scaler and ordered feature list are bundled in

Either file supports both outputs:

Method on the loaded model What it returns
model.predict(X) int64 array of 0/1 β€” the hard decision
model.predict_proba(X)[:, 1] float64 array in [0, 1] β€” the win probability
model.predict_proba(X) (n, 2) array β€” [P(loss), P(win)] per row

For a yes/no answer, call .predict(). For a confidence score, call .predict_proba(X)[:, 1]. For a label that comes with its probability, call both.


12. Repository Contents

RKugel/pokemon_battle_Dor/
β”œβ”€β”€ README.md                                       ← you are here
β”œβ”€β”€ Rkugel_assigment_2_,_AI_data_science.ipynb      ← full notebook (EDA + modeling)
β”œβ”€β”€ pokemon_gradient_boosting.pkl                   ← regression model (probability output)
β”œβ”€β”€ pokemon_classification_model.pkl                ← classification model + scaler + features
β”œβ”€β”€ 01_eda_legendary_scenarios.png
β”œβ”€β”€ 02_eda_correlation_heatmap.png
β”œβ”€β”€ 03_eda_speed_gap_violin.png
β”œβ”€β”€ 04_eda_generation_gap_no_signal.png
β”œβ”€β”€ 05_eda_total_stat_kde.png
β”œβ”€β”€ 06_baseline_diagnostics.png
β”œβ”€β”€ 07_baseline_coefficients.png
β”œβ”€β”€ 08_kmeans_elbow.png
β”œβ”€β”€ 09_pca_clusters.png
β”œβ”€β”€ 10_engineered_coefficients.png
β”œβ”€β”€ 11_roc_curves.png
β”œβ”€β”€ 12_feature_importance.png
β”œβ”€β”€ 13_confusion_matrices_part5.png
β”œβ”€β”€ 14_class_balance.png
└── 15_confusion_matrices_classifiers.png

13. Citation

@misc{kugel2026pokemon,
  title  = {PokΓ©mon Battle Outcome Predictor},
  author = {Kugel, Roie},
  year   = {2026},
  school = {Reichman University},
  course = {AI \& Data Science},
  note   = {Assignment \#2}
}

Acknowledgements

  • Dataset: Diarmuid27/pokemon-battle-outcomes on Hugging Face.
  • Toolchain: scikit-learn, pandas, numpy, matplotlib, seaborn.
  • Course: AI & Data Science, Reichman University, Spring 2026.

If you read this far β€” thank you. Now go battle.

Dedicated to "Dor", my cousin's son. Thanks for the endless PokΓ©mon stories and for giving me the best idea for this project.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train RKugel/pokemon_battle_Dor

Evaluation results