AI & Data Salary Prediction

🎬 Project Presentation & Walkthrough

Try the model yourself!

You can interact with this model directly using the web interface built with Gradio: Click here to use the Salary Predictor App

1. Project Overview

This project walks through the global AI & Data Science job market. The central question I wanted to answer is:

Can we accurately predict a professional's salary based on their experience, role, location, and working conditions?

To get the best results, I started with a deep EDA to really understand the data through visual research questions. After cleaning and handling outliers, I moved into advanced Feature Engineering—which was the most interesting part. I used K-Means clustering for salary profiles and added polynomial features to catch non-linear patterns. I then compared three regression models, before pivoting to a classification task where I split salaries into three tiers (Low, Medium, High). Every step was backed by metrics and visualizations, and I saved the winning models as pickle files for future use.

2. Dataset

Source: Global AI & Data Jobs Salary Dataset (Kaggle / open-source).

Size: ~90,000 rows × 19 selected features (the original dataset had more columns but I kept what was most relevant).

Time Range: 2020–2026.

Target Variable: salary_usd (continuous — used for regression) / Salary Tier: Low / Medium / High (used for classification).

Selected Features

Feature	Type	Description
salary_usd	Numeric (Target)	Annual salary in USD
experience_level	Categorical	Entry / Mid / Senior / Lead
experience_years	Numeric	Years of professional experience
education_required	Categorical	Minimum education level for the role
weekly_hours	Numeric	Average weekly working hours
promotion_speed	Numeric	Average months between promotions
bonus_usd	Numeric	Annual bonus in USD
job_role	Categorical	Job title (e.g., ML Engineer, Data Analyst)
ai_specialization	Categorical	Sub-field within AI/Data (e.g., NLP, Computer Vision)
work_mode	Categorical	Remote / Hybrid / Onsite
industry	Categorical	Business sector
company_size	Categorical	Small / Medium / Large / etc.
company_rating	Numeric	Employee rating of the company
hiring_difficulty_score	Numeric	How hard it is to fill the role
layoff_risk	Numeric	Estimated layoff risk score
employee_satisfaction	Numeric	Employee satisfaction score
country	Categorical	Country of employment (12 countries)
tax_rate_percent	Numeric	Local income tax rate
cost_of_living_index	Numeric	Local cost of living relative index

Categorical Feature Cardinality

Feature	Unique Values
experience_level	4 (entry, mid, senior, lead)
education_required	5
job_role	8 (top: NLP Engineer, Software Engineer AI, ...)
ai_specialization	8
work_mode	3 (remote, hybrid, onsite)
industry	10
company_size	5
country	12 (top: Canada, Australia, Singapore, ...)

Exploratory Data Analysis (EDA)

Data Cleaning

Before doing any analysis, I needed to make sure the data was clean and consistent. Here's what I did step by step.

Step 1 — Duplicate Removal

I checked for duplicate rows and removed any that existed. Turns out the dataset was already clean — zero duplicates were found.

Step 2 — String Standardization

All categorical columns were lowercased and stripped of extra whitespace. This is important because things like "Remote" and "remote" would be treated as different categories by the model, which would be wrong.

Step 3 — Salary Outlier Removal (IQR Method)

Extreme salary values were removed to make model training more stable. I used the standard IQR method: anything below Q1 − 1.5×IQR or above Q3 + 1.5×IQR was considered an outlier and removed. In total, 1,007 salary outliers were removed, leaving 88,993 clean rows.

Step 4 — Logical Consistency Checks

I also filtered out any rows where experience years were negative or weekly hours were zero — those would clearly be data entry errors.

Descriptive Statistics

After cleaning, here are the key numerical statistics for the dataset:

Metric	Value
Mean Salary	$95,013.69
Median Salary	$86,928.00
Mean–Median Gap	$8,086 (small → balanced distribution)
Avg. Experience Years	6.9 years
75th Percentile Experience	≤ 11 years
Avg. Weekly Hours	45.5 hrs/week
Avg. Promotion Speed	~38 months (3.2 years)
Cost of Living Index Range	0.5 – 2.5 (high global diversity)

The small gap between mean and median salary tells us the distribution is fairly balanced — it has a slight right skew but nothing extreme. This is actually ideal for linear regression because we don't need to apply any log transformation to the target variable.

Salary Distribution

📊 Histogram of Salary Distribution

The histogram shows a moderately right-skewed distribution, where most salaries are clustered between $60,000 and $100,000. The mode (the most common salary) is around $75,000. The KDE (kernel density estimate) curve sits smoothly over the bars, confirming this is a well-behaved continuous distribution. The mean sits slightly to the right of the peak because of those higher-end salaries pulling the average up — but not by too much, which is a good sign.

Research Questions & Visual Insights

❓ Research Question 1: Does experience linearly drive salary?

📊 Scatter Plot — Experience Years vs. Salary with Regression Line

This scatter plot shows a strong upward trend between years of experience and salary. Entry-level salaries (0–2 years) are tightly packed around $50k–$75k, while senior professionals (15+ years) show much more variation — which makes sense because at higher levels, things like specialization and industry start to matter more. The Pearson correlation is r = 0.71, which confirms a strong positive linear relationship.

Answer: Yes — experience is the single strongest linear predictor of salary in this dataset. The relationship holds up consistently across the full career span, although the spread gets wider at the top end.

❓ Research Question 2: Does the career stage create salary jumps?

📊 Box Plot — Salary by Career Stage (entry → mid → senior → lead)

The box plot makes it very clear that each career stage comes with a meaningful salary increase. The median salary climbs at every step, and the most dramatic jump happens at the Senior → Lead transition.

Career Stage	Median Salary
Entry	$63,542
Mid	$79,970
Senior	$107,608
Lead	$144,666

Answer: Yes — every transition brings a significant raise. The entry→mid jump is about $16k, mid→senior is about $28k, and the largest is senior→lead at approximately $37k (+34%). The IQR also widens at higher levels, which means pay becomes more variable the more senior you get.

❓ Research Question 3: Do more working hours mean higher pay?

📊 Joint Hex Plot — Weekly Hours vs. Salary

This one genuinely surprised me. The joint plot shows a completely flat distribution — there's no trend at all between hours worked and salary. The density looks the same no matter how many hours per week someone works. The correlation coefficient is r = 0.00.

Answer: Surprisingly, no. There is literally zero linear correlation between weekly hours and salary. The AI/Data Science industry appears to be a results-oriented market — it rewards what you know and what level you're at, not how many hours you clock in. That's actually an interesting finding from a labor economics perspective.

❓ Research Question 4: Does a higher cost of living mean higher pay?

📊 Scatter Plot — Cost of Living Index vs. Salary with Trendline

Another surprising result. The trendline is essentially flat, and the scatter points are uniformly distributed across all cost-of-living values. The correlation is also r = 0.00.

Answer: No. The AI talent market appears to be fully globalized — salaries are set based on skill demand and seniority at an international level, not by what things cost locally. This means a data scientist in Brazil and one in Switzerland doing the same role can earn similar USD salaries. It's a really interesting finding that suggests AI skills are one of the few labor markets that has truly gone global.

Baseline Regression Model

Goal & Feature Selection

The goal of this section was to build a simple, interpretable starting point before adding more complexity. I picked the five features that seemed most intuitively important based on the EDA — years of experience, career level, job role, hours worked, and cost of living. After one-hot encoding the categorical columns, this gave me 13 total features for the baseline model.

Train-Test Split

I split the data into 80% training (71,194 samples) and 20% testing (17,799 samples). I used random_state=42 throughout the entire project to make sure every run gives the same results.

Baseline Evaluation Results

Metric	Value	Interpretation
R² Score	0.5780	Explains 57.8% of salary variance
MAE	$20,009.46	Average prediction error of ~$20k
RMSE	$27,243.05	Larger errors exist for high-salary predictions

Not bad for a starting point, but there's clearly a lot of room for improvement — especially for high earners.

Visual: Actual vs. Predicted (Baseline)

📊 Scatter Plot — Actual vs. Predicted Salary (Baseline)

The scatter cloud roughly follows the red diagonal line (which would represent a perfect model), which tells us the model is at least directionally correct. But the cloud widens noticeably above $125k — the model consistently struggles with high earners. You can also see distinct "stripes" of points, which come from the categorical jumps introduced by one-hot encoding. This is expected at the baseline stage and will improve once we engineer better features.

Feature Importance — Baseline Coefficients

📊 Horizontal Bar Chart — Baseline Model Coefficients

The bar chart shows how much each feature pushes the predicted salary up or down (in dollars). Some highlights:

Feature	Coefficient	Interpretation
experience_years	+$4,719	Each extra year adds ~$4.7k
job_role_data_analyst	−$26,854	Data Analysts earn ~$27k below AI engineering roles
job_role_research_scientist	+$10,997	Research Scientists earn a ~$11k premium
experience_level_lead	+$6,594	Lead title adds ~$6.6k on top of experience
weekly_hours	~$0	Hours worked have no salary impact
cost_of_living_index	~$0	Local economics don't drive nominal salary

This confirms what we saw in the EDA — hours and cost of living are essentially irrelevant to salary. The model is already picking that up on its own.

Feature Engineering & Clustering

This section was the biggest turning point of the whole project. By engineering smarter features without changing the algorithm at all, I was able to dramatically improve performance. Here's what I did.

K-Means Clustering (New Feature: salary_cluster)

I applied K-Means clustering with k=4 on three standardized features — experience years, salary, and cost of living index — to identify distinct "career archetypes" in the data. The idea is that knowing which cluster a person belongs to gives the model extra context about where they sit in the overall labor market.

Cluster	Avg. Experience	Avg. Salary	Profile
0	13.0 years	$134,357	Senior High-Earner
1	3.3 years	$71,150	Junior Mid-Earner
2	3.1 years	$70,157	Junior Low-Earner
3	13.3 years	$138,240	Senior Top-Earner

The algorithm naturally split the workforce into two main tiers: high-experience/high-pay (clusters 0 and 3) and low-experience/lower-pay (clusters 1 and 2). This new cluster label gives the regression model a kind of "career archetype anchor" that a raw experience count can't fully capture.

PCA Visualization of Clusters

📊 PCA Scatter Plot — 2D Cluster Visualization

To check that the clusters actually make sense, I used PCA (Principal Component Analysis) to reduce the 3D clustering space down to 2 dimensions so we can visualize it. The four clusters show up as four clearly separated groups, which means the K-Means algorithm found real, meaningful structure in the data and not just arbitrary divisions. The horizontal axis (PC1) mainly captures the seniority/income dimension, while the vertical axis (PC2) captures differences in economic context.

DBSCAN Outlier Detection

📊 PCA Scatter Plot — DBSCAN Outlier Map

I also ran DBSCAN — a density-based clustering algorithm that can flag outliers as "noise" points — to double-check for unusual data points. The result: zero outliers detected. Every single data point was classified as a cluster member, which confirms that our dataset is highly consistent and well-structured. The patterns captured by our features hold globally across all samples.

Distance to Centroid Feature

On top of the cluster label itself, I added another engineered feature that measures each data point's Euclidean distance to the center of its nearest cluster. This gives the model a continuous measure of how "typical" a salary profile is — someone sitting right in the middle of a cluster is very representative, while someone near the edge is more unusual. This kind of nuance is hard to capture any other way.

Polynomial Features

To capture the fact that salary growth accelerates with seniority (it's not a straight line — a senior engineer earns disproportionately more than twice what a junior does), I applied degree-2 polynomial transformation to the experience and cost-of-living features. This created five new columns from the original two, the most important being experience_years² which captures the exponential return on seniority.

Encoding & Scaling

All categorical variables were one-hot encoded, and the entire enhanced feature set was standardized using StandardScaler before training. This is necessary because algorithms like K-Means and regularized models are sensitive to feature scale — without standardization, a feature measured in thousands (like salary) would dominate over one measured in single digits (like a rating score).

Enhanced Model Performance (Linear Regression on Engineered Features)

Metric	Baseline	Enhanced	Change
R² Score	0.578	0.909	+57.27%
MAE	$20,009	$9,842	−50.8%

Just by engineering better features — without changing the algorithm at all — the MAE was cut roughly in half and R² jumped from 0.578 to 0.909. That's a massive improvement and shows why feature engineering is considered one of the most important skills in data science.

Improved Regression Models & Winner

With the fully engineered feature set ready, I trained three different regression models and compared them head-to-head.

Models Trained

The three models I compared were: Linear Regression (re-trained on the enhanced features as a strong benchmark), Random Forest Regressor (an ensemble of decision trees), and Gradient Boosting Regressor (a sequential ensemble that corrects its own errors step by step).

Comparison Results

Model	MAE	R² Score
Baseline Linear Regression (raw features)	$20,009	0.578
Linear Regression (engineered features)	$9,842	0.909
Gradient Boosting Regressor	$8,632	0.928
🏆 Random Forest Regressor	$7,999	0.938

Total improvement over baseline: +62.29% in R²

Regression Winner: Random Forest Regressor

The Random Forest came out on top with the lowest MAE ($7,999) and highest R² (0.938). In other words, on average the model's salary predictions are within $8,000 of the true value — which is impressive given this is a global dataset with 12 different countries and enormous variation in role types.

It handled the high-dimensional one-hot encoded data (with many country and role dummy columns) without overfitting. It naturally captured non-linear thresholds — like the fact that the "lead" job title brings a sudden salary jump only after a certain experience level. It's also robust to the multicollinearity that inevitably exists between our polynomial features, which is something Linear Regression struggles with.

Random Forest Feature Importance

📊 Horizontal Bar Chart — Top 15 Most Important Features (Random Forest)

The chart shows Gini importance scores for the top 15 features. The key finding is that the engineered experience_years² feature came out as the single most important variable in the entire model — beating even the original experience_years. This validates the hypothesis that salary growth is non-linear and accelerates with seniority.

Rank	Feature	Importance	Interpretation
1	experience_years²	0.27	Non-linear seniority effect is #1 predictor
2	experience_years	0.22	Linear seniority also highly important
3	country_india	~0.07	India shows a distinct downward salary adjustment
4	country_brazil	~0.06	Brazil similarly adjusts salary down
5	bonus_usd	~0.05	Total compensation package matters
6	country_usa	~0.04	USA carries a premium above global baseline
7–8	salary_cluster_1/2	~0.03 each	Cluster features provide tier anchoring
9	employee_satisfaction	~0.03	Higher pay correlates with better satisfaction

Regression-to-Classification

Class Creation Strategy

For the second half of the project, I reframed the salary prediction problem as a 3-class classification task. Instead of predicting an exact number, the model has to decide whether someone's salary falls in the Low, Medium, or High bracket.

To create these classes, I used Quantile Binning — which divides the salary distribution into three equally sized groups. This ensures each class contains exactly ~33.3% of the data, which prevents the model from getting "lazy" and just predicting the most common class.

Class	Range	Label
Class 0	$28,000 – $71,688	Low
Class 1	$71,688 – $108,154	Medium
Class 2	$108,154 – $212,748	High

This framing also has real business value — it's often more useful to know "this person is a high earner" than to know an exact predicted salary that might be off by $8k anyway.

Class Balance Check

📊 Count Plot — Distribution of Salary Classes

Class	Count	Percentage
Low	23,731	33.34%
Medium	23,729	33.33%
High	23,734	33.33%

The three bars are almost identical in height, which is exactly what we want. With perfectly balanced classes like this, accuracy is a reliable and honest metric — there's no majority class that would artificially inflate it.

Classification Models & Winner

Precision vs. Recall — Why It Matters Here

Before jumping into results, I think it's worth explaining the model evaluation context. In this salary classification task, Precision matters more than Recall.

Here's my reasoning: if the model predicts someone falls in the "High" salary bracket, that person is going to make real career decisions based on that — setting salary expectations, negotiating offers, maybe even turning down jobs. A False Positive (predicting "High" when it's actually "Medium") creates unrealistic expectations that lead to failed negotiations and disappointment. On the employer side, this could lead to systematic overpayment when trying to match predictions.

A False Negative (predicting "Medium" when it's actually "High") is less damaging — the person might be pleasantly surprised when reality exceeds the prediction. So across all models, I paid special attention to precision on the "High" class.

Three Classification Models

I trained three models: Logistic Regression (a linear multi-class baseline), an Optimized Random Forest with GridSearchCV hyperparameter tuning, and K-Nearest Neighbors (KNN).

Classification Reports

Model 1 — Logistic Regression

Class	Precision	Recall	F1
Low	~0.91	~0.93	~0.92
Medium	~0.79	~0.77	~0.78
High	~0.92	~0.93	~0.92
Overall Accuracy			~0.87

Model 2 — Optimized Random Forest

Class	Precision	Recall	F1
Low	~0.93	~0.95	~0.94
Medium	~0.81	~0.78	~0.80
High	~0.94	~0.94	~0.94
Overall Accuracy			~0.89

Model 3 — K-Nearest Neighbors

Class	Precision	Recall	F1
Low	~0.88	~0.88	~0.88
Medium	~0.74	~0.73	~0.74
High	~0.87	~0.88	~0.87
Overall Accuracy			~0.83

Confusion Matrix Analysis

📊 Confusion Matrix — Logistic Regression

📊 Confusion Matrix — Optimized Random Forest

📊 Confusion Matrix — K-Nearest Neighbors

The most interesting pattern across all three models is that the "Medium" salary class is the hardest to classify correctly. All three classifiers make their most mistakes on that bracket. This makes complete sense — the Medium class ($71k–$108k) sits between two well-defined extremes, and the boundary between "is this person High or Medium?" is naturally more ambiguous where salary distributions overlap.

Importantly, no model ever confused Low with High directly — all misclassifications were to or from the adjacent Medium class. This means the models correctly understand the overall salary structure even when they make mistakes. The confusion matrices all show a dominant green diagonal (correct predictions), with the only off-diagonal values concentrated around the Medium row and column.

🏆 Classification Winner: Optimized Random Forest

The GridSearchCV-tuned Random Forest was the clear winner with the best overall accuracy (~89%) and the best precision on the "High" class — which is the most critical metric for this use case. The best hyperparameters found by cross-validation were: max_depth=20, min_samples_split=5, n_estimators=100.

It outperformed Logistic Regression because salary classification involves non-linear interactions between features (like experience × job role × country) that a linear boundary can't represent well. It outperformed KNN because KNN suffers from the curse of dimensionality in our high-dimensional one-hot encoded feature space — as you add more dimensions, the concept of "nearest neighbor" breaks down.

Model Files

This repository contains the following files:

File	Description
winning_salary_model.pkl	Regression model — Random Forest Regressor (R² = 0.938, MAE = $7,999)
salary_classifier.pkl	Classification model — Optimized Random Forest (Accuracy ~89%)
*.ipynb	Full Python notebook with all code, outputs, and visualizations

To load and use the saved models, open the pickle files with Python's built-in pickle library. The regression model pickle contains the trained model, the fitted scaler, and the feature list all bundled together. Always preprocess new input data with the same scaler before making predictions — otherwise the model will produce garbage outputs because the feature scales will be wrong.

Key Takeaways & Lessons Learned

What Worked

Feature Engineering was the biggest unlock. I honestly didn't realize how much it would matter until I saw the results. Moving from raw features to engineered ones — polynomial experience_years², K-Means salary clusters, country encoding — improved the Linear Regression R² from 0.578 to 0.909. That's a 57% improvement and I didn't change the algorithm at all. It really drove home the idea that data representation matters more than model choice in many cases.

The globalized AI market insight was genuinely surprising. Both weekly hours and cost-of-living index showed zero correlation with salary (r = 0.00 for both). This strongly suggests that the AI talent market operates on global skill demand rather than local economic conditions. I went in expecting cost of living to matter a lot, so this was a real "wait, what?" moment when I saw the scatter plot.

Clustering added structural context the model couldn't get any other way. The K-Means clusters gave the regression model a "career archetype" signal that a simple experience number couldn't fully capture — it distinguishes between a junior in a high-cost city and a senior in a low-cost region in a way that was actually meaningful to the model.

Ensemble models outperformed linear models because salary determination involves non-linear thresholds (a job title premium that only kicks in past a certain experience level), multiplicative interactions (country × seniority), and complex categorical combinations that linear algebra simply cannot represent efficiently.

What Was Challenging

The "Medium" salary bracket was consistently the hardest class to predict in every single classification model. This is an inherent limitation of quantile binning: the middle class occupies a fuzzy zone between two well-defined extremes. There's no clean feature that says "this person is Medium and not High" — the signal-to-noise ratio at that boundary is just lower.

Choosing the right k for K-Means required actual justification rather than just trying values. I chose k=4 to reflect the four natural career stages (entry, mid, senior, lead) and validated it visually with PCA. But it was a judgment call that I had to think carefully about.

Iterative Process Summary

Starting from a baseline Linear Regression with R² of 0.578 and MAE of $20,009, adding Feature Engineering (clustering, polynomials, better encoding) brought the Enhanced Linear Regression up to R² of 0.909 and MAE of $9,842. Then switching to a Random Forest ensemble model pushed it further to the Final Regression Model at R² of 0.938 and MAE of $7,999 — a total improvement of 62.3% in R² from start to finish.

On the classification side, the baseline KNN model achieved about 83% accuracy. After applying GridSearchCV hyperparameter tuning, the Final Optimized Random Forest Classification Model reached approximately 89% accuracy.

Dataset: Global AI & Data Jobs Salary Dataset (2020–2026)

Created by Maya Cheruty | Reichman University | 2026

Downloads last month: -