Instructions to use RKugel/pokemon_battle_Dor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use RKugel/pokemon_battle_Dor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("RKugel/pokemon_battle_Dor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- PokΓ©mon Battle Outcome Predictor
- TL;DR
- 1. The Research Question
- 2. The Dataset
- 3. The Story the Data Tells (EDA)
- 4. Phase 1 β The Baseline (Linear Regression on Raw Features)
- 5. Feature Engineering β Rebuilding the Feature Space
- 6. Clustering β Discovering Battle Archetypes
- 7. Phase 2 β The Three Real Models
- 8. Reframing as Classification
- 9. Diagnostic Deep Dive
- 10. Why Gradient Boosting Wins
- 11. How to Use This Model
- 12. Repository Contents
- 13. Citation
- Acknowledgements
- TL;DR
π₯ Project Video
βΆοΈ Watch the project walkthrough video
PokΓ©mon Battle Outcome Predictor
Given two PokΓ©mon, who wins? A complete data-science pipeline β from raw battle logs to a deployed model β built as part of the AI & Data Science course at Reichman University.
Author: Roie Kugel Β· Course: AI & Data Science (Assignment #2) Β· Date: April 2026
TL;DR
| Task | Predict whether Player 1's PokΓ©mon wins a 1-v-1 battle |
| Dataset | Diarmuid27/pokemon-battle-outcomes β 50,000 battles Β· 25 features |
| Best model | Gradient Boosting Classifier |
| Headline result | 94.72% accuracy Β· 0.978 AUC-ROC |
| Key insight | Relative feature gaps (P1 minus P2) carry the signal β raw per-PokΓ©mon stats do not |
1. The Research Question
Can we predict, with high confidence, the outcome of a PokΓ©mon battle from each combatant's base stats alone?
The project answers this question in two passes.
The regression pass estimates the probability that Player 1 wins β a continuous score between 0 and 1 that is interpretable as confidence. The classification pass turns that probability into a hard Win / Loss decision so the model can be deployed as a real predictor.
Both passes share the same feature pipeline. The split between regression and classification is methodological, not data-driven, and is explained in Β§7 and Β§8.
2. The Dataset
Source. Diarmuid27/pokemon-battle-outcomes on Hugging Face.
Size. 50,000 battles Γ 25 columns.
Schema.
| Group | Columns | Type |
|---|---|---|
| Identity | Name_First, Name_Second, Type 1_*, Type 2_* |
text |
| Stats | HP_*, Attack_*, Defense_*, Sp. Atk_*, Sp. Def_*, Speed_* |
numeric |
| Meta | Generation_*, Legendary_* |
numeric / boolean |
| Battle | First_pokemon, Second_pokemon, Winner |
numeric (IDs) |
The suffixes _First and _Second denote Player 1's and Player 2's PokΓ©mon respectively.
2.1 What we cleaned, and why
| Issue found | Volume | Decision | Reasoning |
|---|---|---|---|
Name_First or Name_Second is NaN |
small subset of rows | Dropped these battles | A battle with an unknown combatant carries no meaningful signal β neither stats nor identity are recoverable. |
Type 2_* is NaN |
β 50% of rows | Filled with "None" |
This is not missingness β single-type PokΓ©mon have no second type by design. Replacing with a sentinel preserves the information rather than discarding half the data. |
Column-name inconsistency (Sp. Atk vs Sp. Def vs HP) |
structural | Normalized to lowercase | Prevents silent KeyErrors during feature engineering. |
After cleaning, every row in the dataset is a complete battle record.
3. The Story the Data Tells (EDA)
Five research questions guided the exploration. Each one is paired with a single chart and a one-paragraph answer.
3.1 Does Legendary status decide battles?
Answer β yes, decisively. When Player 1 fields a Legendary against a Common PokΓ©mon, the win rate jumps to roughly 78%. The mirror scenario (Common vs. Legendary) collapses to 17%. Within the same tier (both Legendary, both Common) the win rate sits near 50% β a fair fight. Legendary status is the single most predictive feature in the raw dataset, and motivates the engineered legendary_advantage feature in Β§5.
3.2 Which stats correlate most strongly with winning?
Answer β Speed is the top single correlate, but no individual stat is enough on its own.
Note:
total_firstis a feature derived during EDA β it sums all six base stats of P1's PokΓ©mon (hp_first + attack_first + defense_first + sp_atk_first + sp_def_first + speed_first). It is the standard "Base Stat Total" from the PokΓ©mon games, used here as a single-number summary of overall power.total_secondis the same sum for P2.
Reading the bottom row of the heatmap (correlation of each P1 feature with p1_won):
| Feature | Correlation with p1_won |
|---|---|
speed_first |
0.49 |
total_first |
0.34 |
attack_first |
0.26 |
sp_atk_first |
0.25 |
sp_def_first |
0.16 |
hp_first |
0.14 |
defense_first |
0.06 |
Speed alone explains the most variance in the outcome of any single feature β consistent with the Β§3.3 finding that whoever strikes first usually wins. total_first is third (after speed and attack). The heatmap also shows heavy multicollinearity between the offensive stats (attack β sp_atk = 0.40, total β each stat = 0.58β0.74), meaning a linear model fed all six stats would suffer from redundant signals.
This motivates engineering total_diff = total_first β total_second (the power gap between the two combatants) and total_ratio = total_first / total_second (the proportional advantage) β not because total is the strongest individual signal, but because aggregating the six correlated raw stats into one compact, multicollinearity-resistant feature gives the model a clean overall-power signal without six overlapping inputs.
3.3 Does being faster than your opponent translate into victory?
Answer β yes, with a clean separation. When P1 wins, the distribution of speed_first β speed_second skews positive. When P1 loses, it skews negative. The dashed line at zero (equal speed) sits cleanly between the two distributions. Speed advantage is the second-most reliable single signal after total stats.
3.4 Does the PokΓ©mon's generation matter?
Answer β no, and showing this matters. Across every generation gap from β5 to +5, the win rate stays inside a narrow band of β 0.45 to 0.55 β within the noise zone. The right-hand panel confirms sample sizes are large enough for the result to be trusted. Generation is dropped from the feature set. Including features that don't predict the target is a common student mistake; ruling them out explicitly is part of rigorous modeling.
3.5 Does total power gap separate winners from losers?
Answer β yes, this is the strongest engineered signal. The two KDE distributions barely overlap. P1 wins cluster on the positive side of zero (P1 had more total stats), losses cluster on the negative side. The vertical line at zero acts as a natural decision boundary β and indeed, an "always pick the higher-total PokΓ©mon" heuristic would already achieve roughly 80% accuracy. The model's job is to push past that ceiling on the borderline matchups, where total stats are close.
4. Phase 1 β The Baseline (Linear Regression on Raw Features)
Before any feature engineering, we establish an honest reference point: Linear Regression trained on the raw per-PokΓ©mon stats only β no gaps, no totals, no clusters. Just the 14 raw inputs the dataset hands us (12 stats + 2 legendary flags).
Result: RΒ² β 0.47, MAE β 0.30.
That's decent β but the diagnostics show why it is a ceiling, not an answer.
The scatter forms two horizontal bands rather than tracking the diagonal β a tell-tale sign that a linear model is the wrong shape for a binary target. Residuals are bimodal, not normal.
Even at the baseline level, the model already detects the right signals: the legendary flags carry the strongest coefficients, followed by speed. This previews what later models will confirm at scale β but a linear model can never fully express the non-linear physics of a battle.
This baseline motivates everything that follows: feature engineering (Β§5), clustering (Β§6), and the three real models (Β§7).
5. Feature Engineering β Rebuilding the Feature Space
Raw per-PokΓ©mon stats give the model 14 numeric columns. But the underlying physics of a battle is not "How big is P1's HP?" β it is "How does P1's HP compare to P2's?" Feature engineering rebuilds the dataset around relative signals.
5.1 The four families of engineered features
| Family | Features | What it captures |
|---|---|---|
| Gaps | hp_gap, attack_gap, defense_gap, sp_atk_gap, sp_def_gap, speed_gap |
Per-stat relative advantage (P1 β P2). |
| Totals | total_first, total_second, total_diff, total_ratio |
Overall power balance, both as a difference and as a proportional ratio. |
| Legendary | legendary_advantage β {β1, 0, +1} |
Replaces two raw flags with a single mismatch signal. |
| Cluster | cluster_id, dist_to_centroid, prob_cluster_0..3 |
Battle archetype β see Β§6. |
The engineered set replaces the 14 raw features with 15 richer, model-ready signals.
5.2 Visualizing the lift from engineering alone
Before introducing tree models, the same Linear Regression is retrained on the engineered features to isolate the lift produced by feature engineering by itself:
total_diff, legendary_advantage, and speed_gap rise to the top of the coefficient ranking β exactly the three signals EDA pre-flagged as most informative. This validates the engineering choices before the heavy artillery arrives.
6. Clustering β Discovering Battle Archetypes
A PokΓ©mon battle is rarely "just stats." A glass-cannon vs. tank match plays differently than two balanced rivals. To capture this, we run K-Means clustering on the gap features and turn cluster identity into model inputs.
6.1 Choosing K
The elbow method on inertia from K = 2 to 10 lands cleanly at K = 4.
6.2 Visualizing the clusters with PCA
Reducing the 9-dimensional engineered feature space to two principal components (PC1 + PC2 β 70% of variance) shows the four clusters as distinct regions of the space:
6.3 Three features derived from clustering
cluster_idβ which battle archetype does this match-up belong to?dist_to_centroidβ how typical is this battle within its archetype? Small distance = textbook example. Large distance = ambiguous, on the boundary between archetypes.prob_cluster_kβ soft membership scores derived from inverse distances. A battle split0.45 / 0.40 / 0.10 / 0.05is genuinely ambiguous; one at0.95 / 0.02 / 0.02 / 0.01is unambiguous. The model uses this to calibrate its own confidence.
7. Phase 2 β The Three Real Models
With the engineered + clustered features in hand, we move past linear models and train the three "real" candidates on the upgraded feature set:
| # | Model | Trained in | Why this model |
|---|---|---|---|
| 1 | Logistic Regression | Part 4 | The proper probabilistic linear model for a binary target β outputs always live in [0, 1], unlike Linear Regression. Direct upgrade over the baseline. |
| 2 | Random Forest | Part 5 | Ensemble of independent decision trees. Captures non-linear interactions ("speed advantage matters most when total stats are close") that no linear model can see. |
| 3 | Gradient Boosting | Part 5 | Sequential ensemble β each tree is trained specifically to fix the residuals of the previous trees. The strongest tabular performer in the lineup. |
For the two tree models, the probability output (.predict_proba()[:, 1]) plays the role of the continuous regression score. This avoids the structural weakness of Linear Regression on binary targets and gives a properly calibrated value in [0, 1].
π 7.1 The Winner: Gradient Boosting
After training and 5-fold cross-validation:
| Model | Accuracy | AUC-ROC |
|---|---|---|
| Logistic Regression | 0.8857 | 0.9271 |
| Random Forest | 0.9450 | 0.9751 |
| Gradient Boosting β | 0.9472 | 0.9780 |
Gradient Boosting wins with 94.72% accuracy and AUC-ROC = 0.978. That is roughly 6 percentage points above Logistic Regression and a clear margin over Random Forest at every threshold. Cross-validation confirms the result is stable, not a lucky split.
The winning model is saved at pokemon_gradient_boosting.pkl. Call .predict_proba(X)[:, 1] on it to obtain the probability in [0, 1].
8. Reframing as Classification
The same problem can also be framed as classification: not "what is P(P1 wins)?" but "will P1 win β yes or no?" The target p1_won is already binary (1 = win, 0 = loss), so the engineered features carry over directly.
Class balance check. Classes are nearly balanced (β 52.7% losses, 47.3% wins, ratio 1.11 : 1). Accuracy is therefore a trustworthy headline metric.
The same three models β Logistic Regression, Random Forest, Gradient Boosting β are retrained on the engineered features in classification mode. Gradient Boosting wins again and is saved as pokemon_classification_model.pkl (a self-contained bundle: model + scaler + feature names).
9. Diagnostic Deep Dive
9.1 ROC curves β comparing the three models
The Gradient Boosting curve hugs the top-left corner more tightly than the others β it makes fewer false positives at every threshold, not just at the default 0.5.
9.2 Feature importance β what the models actually learned
Both tree models converge on the same top three drivers: total_diff, speed_gap, and legendary_advantage β the three signals that EDA pre-flagged as the most informative. The clustering features (cluster_id, dist_to_centroid, prob_cluster_*) contribute incrementally; they do not dominate, but they sharpen the borderline cases.
9.3 Confusion matrices β where the models still fail
The remaining errors concentrate on near-tie battles β match-ups where total_diff is close to zero and legendary_advantage is also zero. These are genuinely ambiguous battles where even a domain expert could not be certain of the outcome. The model is not failing on easy cases; it is failing on unfair-to-call cases.
10. Why Gradient Boosting Wins
- Sequential error correction. Random Forest votes among independent trees. Gradient Boosting builds each new tree specifically to fix the previous trees' mistakes β exactly what the borderline near-tie cases need.
- Native non-linearity. Battle outcomes depend on interaction terms (speed advantage matters most when total stats are close). Tree splits encode these interactions without being told to look for them; linear models cannot.
- Direct loss optimization. Gradient Boosting optimizes log-loss at every step β the loss function aligned with the binary target.
The improvement over Random Forest is small (~0.2 pp) because the engineered features already do most of the work. This is the correct order of operations: good features first, fancy models second.
11. How to Use This Model
The repo ships two pickle files, each suited to a different use-case. Both are powered by the same Gradient Boosting model, but they are packaged differently and answer slightly different questions.
11.1 What each model returns
| File | Use-case | What you get back |
|---|---|---|
pokemon_gradient_boosting.pkl (regression pass) |
"How confident are we that P1 wins?" | A probability between 0 and 1 β e.g. 0.87 means "there is an 87% chance P1 wins." |
pokemon_classification_model.pkl (classification pass) |
"Will P1 win β yes or no?" | A binary label (1 = P1 wins, 0 = P1 loses) β and, on request, the underlying probability. Also bundles the fitted StandardScaler and feature list, so it is fully self-contained. |
In short: the regression file gives you a percentage; the classification file gives you a decision (and the percentage if you ask for it). Both files describe the same underlying battle prediction β they just expose the answer at different levels of granularity.
11.2 Loading the regression model (probability output)
import pickle
with open("pokemon_gradient_boosting.pkl", "rb") as f:
reg_model = pickle.load(f)
# X must already be scaled with the same StandardScaler used in training.
# Returns the probability of P1 winning (0.0 β 1.0).
probability = reg_model.predict_proba(X)[:, 1]
print(f"Chance of P1 winning: {probability[0]:.1%}")
11.3 Loading the classification model (decision + probability)
import pickle, pandas as pd
# This file is a SELF-CONTAINED bundle: model + scaler + feature names.
with open("pokemon_classification_model.pkl", "rb") as f:
bundle = pickle.load(f)
model = bundle["model"] # GradientBoostingClassifier
scaler = bundle["scaler"] # fitted StandardScaler from training
features = bundle["features"] # ordered list of expected feature names
# Build a one-row DataFrame containing the 15 engineered features
new_battle = pd.DataFrame([{
"hp_gap": 30, "attack_gap": -15, "defense_gap": 10,
"sp_atk_gap": 25, "sp_def_gap": -5, "speed_gap": 40,
"total_diff": 85, "total_ratio": 1.18,
"legendary_advantage": 1,
"cluster_id": 2, "dist_to_centroid": 0.84,
"prob_cluster_0": 0.05, "prob_cluster_1": 0.10,
"prob_cluster_2": 0.78, "prob_cluster_3": 0.07,
}])
X = scaler.transform(new_battle[features])
decision = model.predict(X)[0] # 0 or 1
probability = model.predict_proba(X)[0, 1] # P(P1 wins)
print(f"Outcome: {'P1 WINS' if decision == 1 else 'P1 LOSES'} "
f"(confidence: {probability:.1%})")
11.4 Which file should I use?
Both files contain the same Gradient Boosting Classifier β the difference is only how it is packaged:
| File | What's inside | When to choose it |
|---|---|---|
pokemon_gradient_boosting.pkl |
A bare GradientBoostingClassifier object |
When you already have your own StandardScaler and engineered features and want minimal overhead |
pokemon_classification_model.pkl |
A bundle dict: {"model": β¦, "scaler": β¦, "features": [β¦]} |
When you want a self-contained, ready-to-use predictor β the scaler and ordered feature list are bundled in |
Either file supports both outputs:
| Method on the loaded model | What it returns |
|---|---|
model.predict(X) |
int64 array of 0/1 β the hard decision |
model.predict_proba(X)[:, 1] |
float64 array in [0, 1] β the win probability |
model.predict_proba(X) |
(n, 2) array β [P(loss), P(win)] per row |
For a yes/no answer, call .predict(). For a confidence score, call .predict_proba(X)[:, 1]. For a label that comes with its probability, call both.
12. Repository Contents
RKugel/pokemon_battle_Dor/
βββ README.md β you are here
βββ Rkugel_assigment_2_,_AI_data_science.ipynb β full notebook (EDA + modeling)
βββ pokemon_gradient_boosting.pkl β regression model (probability output)
βββ pokemon_classification_model.pkl β classification model + scaler + features
βββ 01_eda_legendary_scenarios.png
βββ 02_eda_correlation_heatmap.png
βββ 03_eda_speed_gap_violin.png
βββ 04_eda_generation_gap_no_signal.png
βββ 05_eda_total_stat_kde.png
βββ 06_baseline_diagnostics.png
βββ 07_baseline_coefficients.png
βββ 08_kmeans_elbow.png
βββ 09_pca_clusters.png
βββ 10_engineered_coefficients.png
βββ 11_roc_curves.png
βββ 12_feature_importance.png
βββ 13_confusion_matrices_part5.png
βββ 14_class_balance.png
βββ 15_confusion_matrices_classifiers.png
13. Citation
@misc{kugel2026pokemon,
title = {PokΓ©mon Battle Outcome Predictor},
author = {Kugel, Roie},
year = {2026},
school = {Reichman University},
course = {AI \& Data Science},
note = {Assignment \#2}
}
Acknowledgements
- Dataset:
Diarmuid27/pokemon-battle-outcomeson Hugging Face. - Toolchain:
scikit-learn,pandas,numpy,matplotlib,seaborn. - Course: AI & Data Science, Reichman University, Spring 2026.
If you read this far β thank you. Now go battle.
Dedicated to "Dor", my cousin's son. Thanks for the endless PokΓ©mon stories and for giving me the best idea for this project.
- Downloads last month
- -
Dataset used to train RKugel/pokemon_battle_Dor
Evaluation results
- accuracy on pokemon-battle-outcomesself-reported0.947
- roc_auc on pokemon-battle-outcomesself-reported0.978
- f1 on pokemon-battle-outcomesself-reported0.945













