AI & Data Salary Prediction

🎬 Project Presentation & Walkthrough

Try the model yourself!

You can interact with this model directly using the web interface built with Gradio: Click here to use the Salary Predictor App

1. Project Overview

This project walks through the global AI & Data Science job market. The central question I wanted to answer is:

Can we accurately predict a professional's salary based on their experience, role, location, and working conditions?

To get the best results, I started with a deep EDA to really understand the data through visual research questions. After cleaning and handling outliers, I moved into advanced Feature Engineeringβ€”which was the most interesting part. I used K-Means clustering for salary profiles and added polynomial features to catch non-linear patterns. I then compared three regression models, before pivoting to a classification task where I split salaries into three tiers (Low, Medium, High). Every step was backed by metrics and visualizations, and I saved the winning models as pickle files for future use.


2. Dataset

Source: Global AI & Data Jobs Salary Dataset (Kaggle / open-source).

Size: ~90,000 rows Γ— 19 selected features (the original dataset had more columns but I kept what was most relevant).

Time Range: 2020–2026.

Target Variable: salary_usd (continuous β€” used for regression) / Salary Tier: Low / Medium / High (used for classification).

Selected Features

Feature Type Description
salary_usd Numeric (Target) Annual salary in USD
experience_level Categorical Entry / Mid / Senior / Lead
experience_years Numeric Years of professional experience
education_required Categorical Minimum education level for the role
weekly_hours Numeric Average weekly working hours
promotion_speed Numeric Average months between promotions
bonus_usd Numeric Annual bonus in USD
job_role Categorical Job title (e.g., ML Engineer, Data Analyst)
ai_specialization Categorical Sub-field within AI/Data (e.g., NLP, Computer Vision)
work_mode Categorical Remote / Hybrid / Onsite
industry Categorical Business sector
company_size Categorical Small / Medium / Large / etc.
company_rating Numeric Employee rating of the company
hiring_difficulty_score Numeric How hard it is to fill the role
layoff_risk Numeric Estimated layoff risk score
employee_satisfaction Numeric Employee satisfaction score
country Categorical Country of employment (12 countries)
tax_rate_percent Numeric Local income tax rate
cost_of_living_index Numeric Local cost of living relative index

Categorical Feature Cardinality

Feature Unique Values
experience_level 4 (entry, mid, senior, lead)
education_required 5
job_role 8 (top: NLP Engineer, Software Engineer AI, ...)
ai_specialization 8
work_mode 3 (remote, hybrid, onsite)
industry 10
company_size 5
country 12 (top: Canada, Australia, Singapore, ...)

Exploratory Data Analysis (EDA)

Data Cleaning

Before doing any analysis, I needed to make sure the data was clean and consistent. Here's what I did step by step.

Step 1 β€” Duplicate Removal

I checked for duplicate rows and removed any that existed. Turns out the dataset was already clean β€” zero duplicates were found.

Step 2 β€” String Standardization

All categorical columns were lowercased and stripped of extra whitespace. This is important because things like "Remote" and "remote" would be treated as different categories by the model, which would be wrong.

Step 3 β€” Salary Outlier Removal (IQR Method)

Extreme salary values were removed to make model training more stable. I used the standard IQR method: anything below Q1 βˆ’ 1.5Γ—IQR or above Q3 + 1.5Γ—IQR was considered an outlier and removed. In total, 1,007 salary outliers were removed, leaving 88,993 clean rows.

Step 4 β€” Logical Consistency Checks

I also filtered out any rows where experience years were negative or weekly hours were zero β€” those would clearly be data entry errors.


Descriptive Statistics

After cleaning, here are the key numerical statistics for the dataset:

Metric Value
Mean Salary $95,013.69
Median Salary $86,928.00
Mean–Median Gap $8,086 (small β†’ balanced distribution)
Avg. Experience Years 6.9 years
75th Percentile Experience ≀ 11 years
Avg. Weekly Hours 45.5 hrs/week
Avg. Promotion Speed ~38 months (3.2 years)
Cost of Living Index Range 0.5 – 2.5 (high global diversity)

The small gap between mean and median salary tells us the distribution is fairly balanced β€” it has a slight right skew but nothing extreme. This is actually ideal for linear regression because we don't need to apply any log transformation to the target variable.


Salary Distribution

πŸ“Š Histogram of Salary Distribution

image


The histogram shows a moderately right-skewed distribution, where most salaries are clustered between $60,000 and $100,000. The mode (the most common salary) is around $75,000. The KDE (kernel density estimate) curve sits smoothly over the bars, confirming this is a well-behaved continuous distribution. The mean sits slightly to the right of the peak because of those higher-end salaries pulling the average up β€” but not by too much, which is a good sign.


Research Questions & Visual Insights

❓ Research Question 1: Does experience linearly drive salary?

πŸ“Š Scatter Plot β€” Experience Years vs. Salary with Regression Line

image


This scatter plot shows a strong upward trend between years of experience and salary. Entry-level salaries (0–2 years) are tightly packed around $50k–$75k, while senior professionals (15+ years) show much more variation β€” which makes sense because at higher levels, things like specialization and industry start to matter more. The Pearson correlation is r = 0.71, which confirms a strong positive linear relationship.

Answer: Yes β€” experience is the single strongest linear predictor of salary in this dataset. The relationship holds up consistently across the full career span, although the spread gets wider at the top end.


❓ Research Question 2: Does the career stage create salary jumps?

πŸ“Š Box Plot β€” Salary by Career Stage (entry β†’ mid β†’ senior β†’ lead)

image


The box plot makes it very clear that each career stage comes with a meaningful salary increase. The median salary climbs at every step, and the most dramatic jump happens at the Senior β†’ Lead transition.

Career Stage Median Salary
Entry $63,542
Mid $79,970
Senior $107,608
Lead $144,666

Answer: Yes — every transition brings a significant raise. The entry→mid jump is about $16k, mid→senior is about $28k, and the largest is senior→lead at approximately $37k (+34%). The IQR also widens at higher levels, which means pay becomes more variable the more senior you get.


❓ Research Question 3: Do more working hours mean higher pay?

πŸ“Š Joint Hex Plot β€” Weekly Hours vs. Salary

image


This one genuinely surprised me. The joint plot shows a completely flat distribution β€” there's no trend at all between hours worked and salary. The density looks the same no matter how many hours per week someone works. The correlation coefficient is r = 0.00.

Answer: Surprisingly, no. There is literally zero linear correlation between weekly hours and salary. The AI/Data Science industry appears to be a results-oriented market β€” it rewards what you know and what level you're at, not how many hours you clock in. That's actually an interesting finding from a labor economics perspective.


❓ Research Question 4: Does a higher cost of living mean higher pay?

πŸ“Š Scatter Plot β€” Cost of Living Index vs. Salary with Trendline

image


Another surprising result. The trendline is essentially flat, and the scatter points are uniformly distributed across all cost-of-living values. The correlation is also r = 0.00.

Answer: No. The AI talent market appears to be fully globalized β€” salaries are set based on skill demand and seniority at an international level, not by what things cost locally. This means a data scientist in Brazil and one in Switzerland doing the same role can earn similar USD salaries. It's a really interesting finding that suggests AI skills are one of the few labor markets that has truly gone global.


Baseline Regression Model

Goal & Feature Selection

The goal of this section was to build a simple, interpretable starting point before adding more complexity. I picked the five features that seemed most intuitively important based on the EDA β€” years of experience, career level, job role, hours worked, and cost of living. After one-hot encoding the categorical columns, this gave me 13 total features for the baseline model.

Train-Test Split

I split the data into 80% training (71,194 samples) and 20% testing (17,799 samples). I used random_state=42 throughout the entire project to make sure every run gives the same results.

Baseline Evaluation Results

Metric Value Interpretation
RΒ² Score 0.5780 Explains 57.8% of salary variance
MAE $20,009.46 Average prediction error of ~$20k
RMSE $27,243.05 Larger errors exist for high-salary predictions

Not bad for a starting point, but there's clearly a lot of room for improvement β€” especially for high earners.

Visual: Actual vs. Predicted (Baseline)

πŸ“Š Scatter Plot β€” Actual vs. Predicted Salary (Baseline)

image

The scatter cloud roughly follows the red diagonal line (which would represent a perfect model), which tells us the model is at least directionally correct. But the cloud widens noticeably above $125k β€” the model consistently struggles with high earners. You can also see distinct "stripes" of points, which come from the categorical jumps introduced by one-hot encoding. This is expected at the baseline stage and will improve once we engineer better features.

Feature Importance β€” Baseline Coefficients

πŸ“Š Horizontal Bar Chart β€” Baseline Model Coefficients

image

The bar chart shows how much each feature pushes the predicted salary up or down (in dollars). Some highlights:

Feature Coefficient Interpretation
experience_years +$4,719 Each extra year adds ~$4.7k
job_role_data_analyst βˆ’$26,854 Data Analysts earn ~$27k below AI engineering roles
job_role_research_scientist +$10,997 Research Scientists earn a ~$11k premium
experience_level_lead +$6,594 Lead title adds ~$6.6k on top of experience
weekly_hours ~$0 Hours worked have no salary impact
cost_of_living_index ~$0 Local economics don't drive nominal salary

This confirms what we saw in the EDA β€” hours and cost of living are essentially irrelevant to salary. The model is already picking that up on its own.


Feature Engineering & Clustering

This section was the biggest turning point of the whole project. By engineering smarter features without changing the algorithm at all, I was able to dramatically improve performance. Here's what I did.


K-Means Clustering (New Feature: salary_cluster)

I applied K-Means clustering with k=4 on three standardized features β€” experience years, salary, and cost of living index β€” to identify distinct "career archetypes" in the data. The idea is that knowing which cluster a person belongs to gives the model extra context about where they sit in the overall labor market.

Cluster Avg. Experience Avg. Salary Profile
0 13.0 years $134,357 Senior High-Earner
1 3.3 years $71,150 Junior Mid-Earner
2 3.1 years $70,157 Junior Low-Earner
3 13.3 years $138,240 Senior Top-Earner

The algorithm naturally split the workforce into two main tiers: high-experience/high-pay (clusters 0 and 3) and low-experience/lower-pay (clusters 1 and 2). This new cluster label gives the regression model a kind of "career archetype anchor" that a raw experience count can't fully capture.


PCA Visualization of Clusters

πŸ“Š PCA Scatter Plot β€” 2D Cluster Visualization

image


To check that the clusters actually make sense, I used PCA (Principal Component Analysis) to reduce the 3D clustering space down to 2 dimensions so we can visualize it. The four clusters show up as four clearly separated groups, which means the K-Means algorithm found real, meaningful structure in the data and not just arbitrary divisions. The horizontal axis (PC1) mainly captures the seniority/income dimension, while the vertical axis (PC2) captures differences in economic context.


DBSCAN Outlier Detection

πŸ“Š PCA Scatter Plot β€” DBSCAN Outlier Map

image


I also ran DBSCAN β€” a density-based clustering algorithm that can flag outliers as "noise" points β€” to double-check for unusual data points. The result: zero outliers detected. Every single data point was classified as a cluster member, which confirms that our dataset is highly consistent and well-structured. The patterns captured by our features hold globally across all samples.


Distance to Centroid Feature

On top of the cluster label itself, I added another engineered feature that measures each data point's Euclidean distance to the center of its nearest cluster. This gives the model a continuous measure of how "typical" a salary profile is β€” someone sitting right in the middle of a cluster is very representative, while someone near the edge is more unusual. This kind of nuance is hard to capture any other way.


Polynomial Features

To capture the fact that salary growth accelerates with seniority (it's not a straight line β€” a senior engineer earns disproportionately more than twice what a junior does), I applied degree-2 polynomial transformation to the experience and cost-of-living features. This created five new columns from the original two, the most important being experience_yearsΒ² which captures the exponential return on seniority.


Encoding & Scaling

All categorical variables were one-hot encoded, and the entire enhanced feature set was standardized using StandardScaler before training. This is necessary because algorithms like K-Means and regularized models are sensitive to feature scale β€” without standardization, a feature measured in thousands (like salary) would dominate over one measured in single digits (like a rating score).


Enhanced Model Performance (Linear Regression on Engineered Features)

Metric Baseline Enhanced Change
RΒ² Score 0.578 0.909 +57.27%
MAE $20,009 $9,842 βˆ’50.8%

Just by engineering better features β€” without changing the algorithm at all β€” the MAE was cut roughly in half and RΒ² jumped from 0.578 to 0.909. That's a massive improvement and shows why feature engineering is considered one of the most important skills in data science.


Improved Regression Models & Winner

With the fully engineered feature set ready, I trained three different regression models and compared them head-to-head.


Models Trained

The three models I compared were: Linear Regression (re-trained on the enhanced features as a strong benchmark), Random Forest Regressor (an ensemble of decision trees), and Gradient Boosting Regressor (a sequential ensemble that corrects its own errors step by step).


Comparison Results

Model MAE RΒ² Score
Baseline Linear Regression (raw features) $20,009 0.578
Linear Regression (engineered features) $9,842 0.909
Gradient Boosting Regressor $8,632 0.928
πŸ† Random Forest Regressor $7,999 0.938

Total improvement over baseline: +62.29% in RΒ²


Regression Winner: Random Forest Regressor

The Random Forest came out on top with the lowest MAE ($7,999) and highest RΒ² (0.938). In other words, on average the model's salary predictions are within $8,000 of the true value β€” which is impressive given this is a global dataset with 12 different countries and enormous variation in role types.

It handled the high-dimensional one-hot encoded data (with many country and role dummy columns) without overfitting. It naturally captured non-linear thresholds β€” like the fact that the "lead" job title brings a sudden salary jump only after a certain experience level. It's also robust to the multicollinearity that inevitably exists between our polynomial features, which is something Linear Regression struggles with.


Random Forest Feature Importance

πŸ“Š Horizontal Bar Chart β€” Top 15 Most Important Features (Random Forest)

image


The chart shows Gini importance scores for the top 15 features. The key finding is that the engineered experience_yearsΒ² feature came out as the single most important variable in the entire model β€” beating even the original experience_years. This validates the hypothesis that salary growth is non-linear and accelerates with seniority.

Rank Feature Importance Interpretation
1 experience_yearsΒ² 0.27 Non-linear seniority effect is #1 predictor
2 experience_years 0.22 Linear seniority also highly important
3 country_india ~0.07 India shows a distinct downward salary adjustment
4 country_brazil ~0.06 Brazil similarly adjusts salary down
5 bonus_usd ~0.05 Total compensation package matters
6 country_usa ~0.04 USA carries a premium above global baseline
7–8 salary_cluster_1/2 ~0.03 each Cluster features provide tier anchoring
9 employee_satisfaction ~0.03 Higher pay correlates with better satisfaction

Regression-to-Classification

Class Creation Strategy

For the second half of the project, I reframed the salary prediction problem as a 3-class classification task. Instead of predicting an exact number, the model has to decide whether someone's salary falls in the Low, Medium, or High bracket.

To create these classes, I used Quantile Binning β€” which divides the salary distribution into three equally sized groups. This ensures each class contains exactly ~33.3% of the data, which prevents the model from getting "lazy" and just predicting the most common class.

Class Range Label
Class 0 $28,000 – $71,688 Low
Class 1 $71,688 – $108,154 Medium
Class 2 $108,154 – $212,748 High

This framing also has real business value β€” it's often more useful to know "this person is a high earner" than to know an exact predicted salary that might be off by $8k anyway.


Class Balance Check

πŸ“Š Count Plot β€” Distribution of Salary Classes

image


Class Count Percentage
Low 23,731 33.34%
Medium 23,729 33.33%
High 23,734 33.33%

The three bars are almost identical in height, which is exactly what we want. With perfectly balanced classes like this, accuracy is a reliable and honest metric β€” there's no majority class that would artificially inflate it.


Classification Models & Winner

Precision vs. Recall β€” Why It Matters Here

Before jumping into results, I think it's worth explaining the model evaluation context. In this salary classification task, Precision matters more than Recall.

Here's my reasoning: if the model predicts someone falls in the "High" salary bracket, that person is going to make real career decisions based on that β€” setting salary expectations, negotiating offers, maybe even turning down jobs. A False Positive (predicting "High" when it's actually "Medium") creates unrealistic expectations that lead to failed negotiations and disappointment. On the employer side, this could lead to systematic overpayment when trying to match predictions.

A False Negative (predicting "Medium" when it's actually "High") is less damaging β€” the person might be pleasantly surprised when reality exceeds the prediction. So across all models, I paid special attention to precision on the "High" class.


Three Classification Models

I trained three models: Logistic Regression (a linear multi-class baseline), an Optimized Random Forest with GridSearchCV hyperparameter tuning, and K-Nearest Neighbors (KNN).


Classification Reports

Model 1 β€” Logistic Regression

Class Precision Recall F1
Low ~0.91 ~0.93 ~0.92
Medium ~0.79 ~0.77 ~0.78
High ~0.92 ~0.93 ~0.92
Overall Accuracy ~0.87

Model 2 β€” Optimized Random Forest

Class Precision Recall F1
Low ~0.93 ~0.95 ~0.94
Medium ~0.81 ~0.78 ~0.80
High ~0.94 ~0.94 ~0.94
Overall Accuracy ~0.89

Model 3 β€” K-Nearest Neighbors

Class Precision Recall F1
Low ~0.88 ~0.88 ~0.88
Medium ~0.74 ~0.73 ~0.74
High ~0.87 ~0.88 ~0.87
Overall Accuracy ~0.83

Confusion Matrix Analysis

πŸ“Š Confusion Matrix β€” Logistic Regression

image


πŸ“Š Confusion Matrix β€” Optimized Random Forest

image


πŸ“Š Confusion Matrix β€” K-Nearest Neighbors

image


The most interesting pattern across all three models is that the "Medium" salary class is the hardest to classify correctly. All three classifiers make their most mistakes on that bracket. This makes complete sense β€” the Medium class ($71k–$108k) sits between two well-defined extremes, and the boundary between "is this person High or Medium?" is naturally more ambiguous where salary distributions overlap.

Importantly, no model ever confused Low with High directly β€” all misclassifications were to or from the adjacent Medium class. This means the models correctly understand the overall salary structure even when they make mistakes. The confusion matrices all show a dominant green diagonal (correct predictions), with the only off-diagonal values concentrated around the Medium row and column.


πŸ† Classification Winner: Optimized Random Forest

The GridSearchCV-tuned Random Forest was the clear winner with the best overall accuracy (~89%) and the best precision on the "High" class β€” which is the most critical metric for this use case. The best hyperparameters found by cross-validation were: max_depth=20, min_samples_split=5, n_estimators=100.

It outperformed Logistic Regression because salary classification involves non-linear interactions between features (like experience Γ— job role Γ— country) that a linear boundary can't represent well. It outperformed KNN because KNN suffers from the curse of dimensionality in our high-dimensional one-hot encoded feature space β€” as you add more dimensions, the concept of "nearest neighbor" breaks down.


Model Files

This repository contains the following files:

File Description
winning_salary_model.pkl Regression model β€” Random Forest Regressor (RΒ² = 0.938, MAE = $7,999)
salary_classifier.pkl Classification model β€” Optimized Random Forest (Accuracy ~89%)
*.ipynb Full Python notebook with all code, outputs, and visualizations

To load and use the saved models, open the pickle files with Python's built-in pickle library. The regression model pickle contains the trained model, the fitted scaler, and the feature list all bundled together. Always preprocess new input data with the same scaler before making predictions β€” otherwise the model will produce garbage outputs because the feature scales will be wrong.


Key Takeaways & Lessons Learned

What Worked

Feature Engineering was the biggest unlock. I honestly didn't realize how much it would matter until I saw the results. Moving from raw features to engineered ones β€” polynomial experience_yearsΒ², K-Means salary clusters, country encoding β€” improved the Linear Regression RΒ² from 0.578 to 0.909. That's a 57% improvement and I didn't change the algorithm at all. It really drove home the idea that data representation matters more than model choice in many cases.

The globalized AI market insight was genuinely surprising. Both weekly hours and cost-of-living index showed zero correlation with salary (r = 0.00 for both). This strongly suggests that the AI talent market operates on global skill demand rather than local economic conditions. I went in expecting cost of living to matter a lot, so this was a real "wait, what?" moment when I saw the scatter plot.

Clustering added structural context the model couldn't get any other way. The K-Means clusters gave the regression model a "career archetype" signal that a simple experience number couldn't fully capture β€” it distinguishes between a junior in a high-cost city and a senior in a low-cost region in a way that was actually meaningful to the model.

Ensemble models outperformed linear models because salary determination involves non-linear thresholds (a job title premium that only kicks in past a certain experience level), multiplicative interactions (country Γ— seniority), and complex categorical combinations that linear algebra simply cannot represent efficiently.

What Was Challenging

The "Medium" salary bracket was consistently the hardest class to predict in every single classification model. This is an inherent limitation of quantile binning: the middle class occupies a fuzzy zone between two well-defined extremes. There's no clean feature that says "this person is Medium and not High" β€” the signal-to-noise ratio at that boundary is just lower.

Choosing the right k for K-Means required actual justification rather than just trying values. I chose k=4 to reflect the four natural career stages (entry, mid, senior, lead) and validated it visually with PCA. But it was a judgment call that I had to think carefully about.


Iterative Process Summary

Starting from a baseline Linear Regression with RΒ² of 0.578 and MAE of $20,009, adding Feature Engineering (clustering, polynomials, better encoding) brought the Enhanced Linear Regression up to RΒ² of 0.909 and MAE of $9,842. Then switching to a Random Forest ensemble model pushed it further to the Final Regression Model at RΒ² of 0.938 and MAE of $7,999 β€” a total improvement of 62.3% in RΒ² from start to finish.

On the classification side, the baseline KNN model achieved about 83% accuracy. After applying GridSearchCV hyperparameter tuning, the Final Optimized Random Forest Classification Model reached approximately 89% accuracy.

Dataset: Global AI & Data Jobs Salary Dataset (2020–2026)

Created by Maya Cheruty | Reichman University | 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support