Instructions to use mayacheruty/AI-Salary-Prediction-Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use mayacheruty/AI-Salary-Prediction-Model with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("mayacheruty/AI-Salary-Prediction-Model", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- AI & Data Salary Prediction
AI & Data Salary Prediction
π¬ Project Presentation & Walkthrough
Try the model yourself!
You can interact with this model directly using the web interface built with Gradio: Click here to use the Salary Predictor App
1. Project Overview
This project walks through the global AI & Data Science job market. The central question I wanted to answer is:
Can we accurately predict a professional's salary based on their experience, role, location, and working conditions?
To get the best results, I started with a deep EDA to really understand the data through visual research questions. After cleaning and handling outliers, I moved into advanced Feature Engineeringβwhich was the most interesting part. I used K-Means clustering for salary profiles and added polynomial features to catch non-linear patterns. I then compared three regression models, before pivoting to a classification task where I split salaries into three tiers (Low, Medium, High). Every step was backed by metrics and visualizations, and I saved the winning models as pickle files for future use.
2. Dataset
Source: Global AI & Data Jobs Salary Dataset (Kaggle / open-source).
Size: ~90,000 rows Γ 19 selected features (the original dataset had more columns but I kept what was most relevant).
Time Range: 2020β2026.
Target Variable: salary_usd (continuous β used for regression) / Salary Tier: Low / Medium / High (used for classification).
Selected Features
| Feature | Type | Description |
|---|---|---|
| salary_usd | Numeric (Target) | Annual salary in USD |
| experience_level | Categorical | Entry / Mid / Senior / Lead |
| experience_years | Numeric | Years of professional experience |
| education_required | Categorical | Minimum education level for the role |
| weekly_hours | Numeric | Average weekly working hours |
| promotion_speed | Numeric | Average months between promotions |
| bonus_usd | Numeric | Annual bonus in USD |
| job_role | Categorical | Job title (e.g., ML Engineer, Data Analyst) |
| ai_specialization | Categorical | Sub-field within AI/Data (e.g., NLP, Computer Vision) |
| work_mode | Categorical | Remote / Hybrid / Onsite |
| industry | Categorical | Business sector |
| company_size | Categorical | Small / Medium / Large / etc. |
| company_rating | Numeric | Employee rating of the company |
| hiring_difficulty_score | Numeric | How hard it is to fill the role |
| layoff_risk | Numeric | Estimated layoff risk score |
| employee_satisfaction | Numeric | Employee satisfaction score |
| country | Categorical | Country of employment (12 countries) |
| tax_rate_percent | Numeric | Local income tax rate |
| cost_of_living_index | Numeric | Local cost of living relative index |
Categorical Feature Cardinality
| Feature | Unique Values |
|---|---|
| experience_level | 4 (entry, mid, senior, lead) |
| education_required | 5 |
| job_role | 8 (top: NLP Engineer, Software Engineer AI, ...) |
| ai_specialization | 8 |
| work_mode | 3 (remote, hybrid, onsite) |
| industry | 10 |
| company_size | 5 |
| country | 12 (top: Canada, Australia, Singapore, ...) |
Exploratory Data Analysis (EDA)
Data Cleaning
Before doing any analysis, I needed to make sure the data was clean and consistent. Here's what I did step by step.
Step 1 β Duplicate Removal
I checked for duplicate rows and removed any that existed. Turns out the dataset was already clean β zero duplicates were found.
Step 2 β String Standardization
All categorical columns were lowercased and stripped of extra whitespace. This is important because things like "Remote" and "remote" would be treated as different categories by the model, which would be wrong.
Step 3 β Salary Outlier Removal (IQR Method)
Extreme salary values were removed to make model training more stable. I used the standard IQR method: anything below Q1 β 1.5ΓIQR or above Q3 + 1.5ΓIQR was considered an outlier and removed. In total, 1,007 salary outliers were removed, leaving 88,993 clean rows.
Step 4 β Logical Consistency Checks
I also filtered out any rows where experience years were negative or weekly hours were zero β those would clearly be data entry errors.
Descriptive Statistics
After cleaning, here are the key numerical statistics for the dataset:
| Metric | Value |
|---|---|
| Mean Salary | $95,013.69 |
| Median Salary | $86,928.00 |
| MeanβMedian Gap | $8,086 (small β balanced distribution) |
| Avg. Experience Years | 6.9 years |
| 75th Percentile Experience | β€ 11 years |
| Avg. Weekly Hours | 45.5 hrs/week |
| Avg. Promotion Speed | ~38 months (3.2 years) |
| Cost of Living Index Range | 0.5 β 2.5 (high global diversity) |
The small gap between mean and median salary tells us the distribution is fairly balanced β it has a slight right skew but nothing extreme. This is actually ideal for linear regression because we don't need to apply any log transformation to the target variable.
Salary Distribution
π Histogram of Salary Distribution
The histogram shows a moderately right-skewed distribution, where most salaries are clustered between $60,000 and $100,000. The mode (the most common salary) is around $75,000. The KDE (kernel density estimate) curve sits smoothly over the bars, confirming this is a well-behaved continuous distribution. The mean sits slightly to the right of the peak because of those higher-end salaries pulling the average up β but not by too much, which is a good sign.
Research Questions & Visual Insights
β Research Question 1: Does experience linearly drive salary?
π Scatter Plot β Experience Years vs. Salary with Regression Line
This scatter plot shows a strong upward trend between years of experience and salary. Entry-level salaries (0β2 years) are tightly packed around $50kβ$75k, while senior professionals (15+ years) show much more variation β which makes sense because at higher levels, things like specialization and industry start to matter more. The Pearson correlation is r = 0.71, which confirms a strong positive linear relationship.
Answer: Yes β experience is the single strongest linear predictor of salary in this dataset. The relationship holds up consistently across the full career span, although the spread gets wider at the top end.
β Research Question 2: Does the career stage create salary jumps?
π Box Plot β Salary by Career Stage (entry β mid β senior β lead)
The box plot makes it very clear that each career stage comes with a meaningful salary increase. The median salary climbs at every step, and the most dramatic jump happens at the Senior β Lead transition.
| Career Stage | Median Salary |
|---|---|
| Entry | $63,542 |
| Mid | $79,970 |
| Senior | $107,608 |
| Lead | $144,666 |
Answer: Yes β every transition brings a significant raise. The entryβmid jump is about $16k, midβsenior is about $28k, and the largest is seniorβlead at approximately $37k (+34%). The IQR also widens at higher levels, which means pay becomes more variable the more senior you get.
β Research Question 3: Do more working hours mean higher pay?
π Joint Hex Plot β Weekly Hours vs. Salary
This one genuinely surprised me. The joint plot shows a completely flat distribution β there's no trend at all between hours worked and salary. The density looks the same no matter how many hours per week someone works. The correlation coefficient is r = 0.00.
Answer: Surprisingly, no. There is literally zero linear correlation between weekly hours and salary. The AI/Data Science industry appears to be a results-oriented market β it rewards what you know and what level you're at, not how many hours you clock in. That's actually an interesting finding from a labor economics perspective.
β Research Question 4: Does a higher cost of living mean higher pay?
π Scatter Plot β Cost of Living Index vs. Salary with Trendline
Another surprising result. The trendline is essentially flat, and the scatter points are uniformly distributed across all cost-of-living values. The correlation is also r = 0.00.
Answer: No. The AI talent market appears to be fully globalized β salaries are set based on skill demand and seniority at an international level, not by what things cost locally. This means a data scientist in Brazil and one in Switzerland doing the same role can earn similar USD salaries. It's a really interesting finding that suggests AI skills are one of the few labor markets that has truly gone global.
Baseline Regression Model
Goal & Feature Selection
The goal of this section was to build a simple, interpretable starting point before adding more complexity. I picked the five features that seemed most intuitively important based on the EDA β years of experience, career level, job role, hours worked, and cost of living. After one-hot encoding the categorical columns, this gave me 13 total features for the baseline model.
Train-Test Split
I split the data into 80% training (71,194 samples) and 20% testing (17,799 samples). I used random_state=42 throughout the entire project to make sure every run gives the same results.
Baseline Evaluation Results
| Metric | Value | Interpretation |
|---|---|---|
| RΒ² Score | 0.5780 | Explains 57.8% of salary variance |
| MAE | $20,009.46 | Average prediction error of ~$20k |
| RMSE | $27,243.05 | Larger errors exist for high-salary predictions |
Not bad for a starting point, but there's clearly a lot of room for improvement β especially for high earners.
Visual: Actual vs. Predicted (Baseline)
π Scatter Plot β Actual vs. Predicted Salary (Baseline)
The scatter cloud roughly follows the red diagonal line (which would represent a perfect model), which tells us the model is at least directionally correct. But the cloud widens noticeably above $125k β the model consistently struggles with high earners. You can also see distinct "stripes" of points, which come from the categorical jumps introduced by one-hot encoding. This is expected at the baseline stage and will improve once we engineer better features.
Feature Importance β Baseline Coefficients
π Horizontal Bar Chart β Baseline Model Coefficients
The bar chart shows how much each feature pushes the predicted salary up or down (in dollars). Some highlights:
| Feature | Coefficient | Interpretation |
|---|---|---|
| experience_years | +$4,719 | Each extra year adds ~$4.7k |
| job_role_data_analyst | β$26,854 | Data Analysts earn ~$27k below AI engineering roles |
| job_role_research_scientist | +$10,997 | Research Scientists earn a ~$11k premium |
| experience_level_lead | +$6,594 | Lead title adds ~$6.6k on top of experience |
| weekly_hours | ~$0 | Hours worked have no salary impact |
| cost_of_living_index | ~$0 | Local economics don't drive nominal salary |
This confirms what we saw in the EDA β hours and cost of living are essentially irrelevant to salary. The model is already picking that up on its own.
Feature Engineering & Clustering
This section was the biggest turning point of the whole project. By engineering smarter features without changing the algorithm at all, I was able to dramatically improve performance. Here's what I did.
K-Means Clustering (New Feature: salary_cluster)
I applied K-Means clustering with k=4 on three standardized features β experience years, salary, and cost of living index β to identify distinct "career archetypes" in the data. The idea is that knowing which cluster a person belongs to gives the model extra context about where they sit in the overall labor market.
| Cluster | Avg. Experience | Avg. Salary | Profile |
|---|---|---|---|
| 0 | 13.0 years | $134,357 | Senior High-Earner |
| 1 | 3.3 years | $71,150 | Junior Mid-Earner |
| 2 | 3.1 years | $70,157 | Junior Low-Earner |
| 3 | 13.3 years | $138,240 | Senior Top-Earner |
The algorithm naturally split the workforce into two main tiers: high-experience/high-pay (clusters 0 and 3) and low-experience/lower-pay (clusters 1 and 2). This new cluster label gives the regression model a kind of "career archetype anchor" that a raw experience count can't fully capture.
PCA Visualization of Clusters
π PCA Scatter Plot β 2D Cluster Visualization
To check that the clusters actually make sense, I used PCA (Principal Component Analysis) to reduce the 3D clustering space down to 2 dimensions so we can visualize it. The four clusters show up as four clearly separated groups, which means the K-Means algorithm found real, meaningful structure in the data and not just arbitrary divisions. The horizontal axis (PC1) mainly captures the seniority/income dimension, while the vertical axis (PC2) captures differences in economic context.
DBSCAN Outlier Detection
π PCA Scatter Plot β DBSCAN Outlier Map
I also ran DBSCAN β a density-based clustering algorithm that can flag outliers as "noise" points β to double-check for unusual data points. The result: zero outliers detected. Every single data point was classified as a cluster member, which confirms that our dataset is highly consistent and well-structured. The patterns captured by our features hold globally across all samples.
Distance to Centroid Feature
On top of the cluster label itself, I added another engineered feature that measures each data point's Euclidean distance to the center of its nearest cluster. This gives the model a continuous measure of how "typical" a salary profile is β someone sitting right in the middle of a cluster is very representative, while someone near the edge is more unusual. This kind of nuance is hard to capture any other way.
Polynomial Features
To capture the fact that salary growth accelerates with seniority (it's not a straight line β a senior engineer earns disproportionately more than twice what a junior does), I applied degree-2 polynomial transformation to the experience and cost-of-living features. This created five new columns from the original two, the most important being experience_yearsΒ² which captures the exponential return on seniority.
Encoding & Scaling
All categorical variables were one-hot encoded, and the entire enhanced feature set was standardized using StandardScaler before training. This is necessary because algorithms like K-Means and regularized models are sensitive to feature scale β without standardization, a feature measured in thousands (like salary) would dominate over one measured in single digits (like a rating score).
Enhanced Model Performance (Linear Regression on Engineered Features)
| Metric | Baseline | Enhanced | Change |
|---|---|---|---|
| RΒ² Score | 0.578 | 0.909 | +57.27% |
| MAE | $20,009 | $9,842 | β50.8% |
Just by engineering better features β without changing the algorithm at all β the MAE was cut roughly in half and RΒ² jumped from 0.578 to 0.909. That's a massive improvement and shows why feature engineering is considered one of the most important skills in data science.
Improved Regression Models & Winner
With the fully engineered feature set ready, I trained three different regression models and compared them head-to-head.
Models Trained
The three models I compared were: Linear Regression (re-trained on the enhanced features as a strong benchmark), Random Forest Regressor (an ensemble of decision trees), and Gradient Boosting Regressor (a sequential ensemble that corrects its own errors step by step).
Comparison Results
| Model | MAE | RΒ² Score |
|---|---|---|
| Baseline Linear Regression (raw features) | $20,009 | 0.578 |
| Linear Regression (engineered features) | $9,842 | 0.909 |
| Gradient Boosting Regressor | $8,632 | 0.928 |
| π Random Forest Regressor | $7,999 | 0.938 |
Total improvement over baseline: +62.29% in RΒ²
Regression Winner: Random Forest Regressor
The Random Forest came out on top with the lowest MAE ($7,999) and highest RΒ² (0.938). In other words, on average the model's salary predictions are within $8,000 of the true value β which is impressive given this is a global dataset with 12 different countries and enormous variation in role types.
It handled the high-dimensional one-hot encoded data (with many country and role dummy columns) without overfitting. It naturally captured non-linear thresholds β like the fact that the "lead" job title brings a sudden salary jump only after a certain experience level. It's also robust to the multicollinearity that inevitably exists between our polynomial features, which is something Linear Regression struggles with.
Random Forest Feature Importance
π Horizontal Bar Chart β Top 15 Most Important Features (Random Forest)
The chart shows Gini importance scores for the top 15 features. The key finding is that the engineered experience_yearsΒ² feature came out as the single most important variable in the entire model β beating even the original experience_years. This validates the hypothesis that salary growth is non-linear and accelerates with seniority.
| Rank | Feature | Importance | Interpretation |
|---|---|---|---|
| 1 | experience_yearsΒ² | 0.27 | Non-linear seniority effect is #1 predictor |
| 2 | experience_years | 0.22 | Linear seniority also highly important |
| 3 | country_india | ~0.07 | India shows a distinct downward salary adjustment |
| 4 | country_brazil | ~0.06 | Brazil similarly adjusts salary down |
| 5 | bonus_usd | ~0.05 | Total compensation package matters |
| 6 | country_usa | ~0.04 | USA carries a premium above global baseline |
| 7β8 | salary_cluster_1/2 | ~0.03 each | Cluster features provide tier anchoring |
| 9 | employee_satisfaction | ~0.03 | Higher pay correlates with better satisfaction |
Regression-to-Classification
Class Creation Strategy
For the second half of the project, I reframed the salary prediction problem as a 3-class classification task. Instead of predicting an exact number, the model has to decide whether someone's salary falls in the Low, Medium, or High bracket.
To create these classes, I used Quantile Binning β which divides the salary distribution into three equally sized groups. This ensures each class contains exactly ~33.3% of the data, which prevents the model from getting "lazy" and just predicting the most common class.
| Class | Range | Label |
|---|---|---|
| Class 0 | $28,000 β $71,688 | Low |
| Class 1 | $71,688 β $108,154 | Medium |
| Class 2 | $108,154 β $212,748 | High |
This framing also has real business value β it's often more useful to know "this person is a high earner" than to know an exact predicted salary that might be off by $8k anyway.
Class Balance Check
π Count Plot β Distribution of Salary Classes
| Class | Count | Percentage |
|---|---|---|
| Low | 23,731 | 33.34% |
| Medium | 23,729 | 33.33% |
| High | 23,734 | 33.33% |
The three bars are almost identical in height, which is exactly what we want. With perfectly balanced classes like this, accuracy is a reliable and honest metric β there's no majority class that would artificially inflate it.
Classification Models & Winner
Precision vs. Recall β Why It Matters Here
Before jumping into results, I think it's worth explaining the model evaluation context. In this salary classification task, Precision matters more than Recall.
Here's my reasoning: if the model predicts someone falls in the "High" salary bracket, that person is going to make real career decisions based on that β setting salary expectations, negotiating offers, maybe even turning down jobs. A False Positive (predicting "High" when it's actually "Medium") creates unrealistic expectations that lead to failed negotiations and disappointment. On the employer side, this could lead to systematic overpayment when trying to match predictions.
A False Negative (predicting "Medium" when it's actually "High") is less damaging β the person might be pleasantly surprised when reality exceeds the prediction. So across all models, I paid special attention to precision on the "High" class.
Three Classification Models
I trained three models: Logistic Regression (a linear multi-class baseline), an Optimized Random Forest with GridSearchCV hyperparameter tuning, and K-Nearest Neighbors (KNN).
Classification Reports
Model 1 β Logistic Regression
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Low | ~0.91 | ~0.93 | ~0.92 |
| Medium | ~0.79 | ~0.77 | ~0.78 |
| High | ~0.92 | ~0.93 | ~0.92 |
| Overall Accuracy | ~0.87 |
Model 2 β Optimized Random Forest
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Low | ~0.93 | ~0.95 | ~0.94 |
| Medium | ~0.81 | ~0.78 | ~0.80 |
| High | ~0.94 | ~0.94 | ~0.94 |
| Overall Accuracy | ~0.89 |
Model 3 β K-Nearest Neighbors
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Low | ~0.88 | ~0.88 | ~0.88 |
| Medium | ~0.74 | ~0.73 | ~0.74 |
| High | ~0.87 | ~0.88 | ~0.87 |
| Overall Accuracy | ~0.83 |
Confusion Matrix Analysis
π Confusion Matrix β Logistic Regression
π Confusion Matrix β Optimized Random Forest
π Confusion Matrix β K-Nearest Neighbors
The most interesting pattern across all three models is that the "Medium" salary class is the hardest to classify correctly. All three classifiers make their most mistakes on that bracket. This makes complete sense β the Medium class ($71kβ$108k) sits between two well-defined extremes, and the boundary between "is this person High or Medium?" is naturally more ambiguous where salary distributions overlap.
Importantly, no model ever confused Low with High directly β all misclassifications were to or from the adjacent Medium class. This means the models correctly understand the overall salary structure even when they make mistakes. The confusion matrices all show a dominant green diagonal (correct predictions), with the only off-diagonal values concentrated around the Medium row and column.
π Classification Winner: Optimized Random Forest
The GridSearchCV-tuned Random Forest was the clear winner with the best overall accuracy (~89%) and the best precision on the "High" class β which is the most critical metric for this use case. The best hyperparameters found by cross-validation were: max_depth=20, min_samples_split=5, n_estimators=100.
It outperformed Logistic Regression because salary classification involves non-linear interactions between features (like experience Γ job role Γ country) that a linear boundary can't represent well. It outperformed KNN because KNN suffers from the curse of dimensionality in our high-dimensional one-hot encoded feature space β as you add more dimensions, the concept of "nearest neighbor" breaks down.
Model Files
This repository contains the following files:
| File | Description |
|---|---|
| winning_salary_model.pkl | Regression model β Random Forest Regressor (RΒ² = 0.938, MAE = $7,999) |
| salary_classifier.pkl | Classification model β Optimized Random Forest (Accuracy ~89%) |
| *.ipynb | Full Python notebook with all code, outputs, and visualizations |
To load and use the saved models, open the pickle files with Python's built-in pickle library. The regression model pickle contains the trained model, the fitted scaler, and the feature list all bundled together. Always preprocess new input data with the same scaler before making predictions β otherwise the model will produce garbage outputs because the feature scales will be wrong.
Key Takeaways & Lessons Learned
What Worked
Feature Engineering was the biggest unlock. I honestly didn't realize how much it would matter until I saw the results. Moving from raw features to engineered ones β polynomial experience_yearsΒ², K-Means salary clusters, country encoding β improved the Linear Regression RΒ² from 0.578 to 0.909. That's a 57% improvement and I didn't change the algorithm at all. It really drove home the idea that data representation matters more than model choice in many cases.
The globalized AI market insight was genuinely surprising. Both weekly hours and cost-of-living index showed zero correlation with salary (r = 0.00 for both). This strongly suggests that the AI talent market operates on global skill demand rather than local economic conditions. I went in expecting cost of living to matter a lot, so this was a real "wait, what?" moment when I saw the scatter plot.
Clustering added structural context the model couldn't get any other way. The K-Means clusters gave the regression model a "career archetype" signal that a simple experience number couldn't fully capture β it distinguishes between a junior in a high-cost city and a senior in a low-cost region in a way that was actually meaningful to the model.
Ensemble models outperformed linear models because salary determination involves non-linear thresholds (a job title premium that only kicks in past a certain experience level), multiplicative interactions (country Γ seniority), and complex categorical combinations that linear algebra simply cannot represent efficiently.
What Was Challenging
The "Medium" salary bracket was consistently the hardest class to predict in every single classification model. This is an inherent limitation of quantile binning: the middle class occupies a fuzzy zone between two well-defined extremes. There's no clean feature that says "this person is Medium and not High" β the signal-to-noise ratio at that boundary is just lower.
Choosing the right k for K-Means required actual justification rather than just trying values. I chose k=4 to reflect the four natural career stages (entry, mid, senior, lead) and validated it visually with PCA. But it was a judgment call that I had to think carefully about.
Iterative Process Summary
Starting from a baseline Linear Regression with RΒ² of 0.578 and MAE of $20,009, adding Feature Engineering (clustering, polynomials, better encoding) brought the Enhanced Linear Regression up to RΒ² of 0.909 and MAE of $9,842. Then switching to a Random Forest ensemble model pushed it further to the Final Regression Model at RΒ² of 0.938 and MAE of $7,999 β a total improvement of 62.3% in RΒ² from start to finish.
On the classification side, the baseline KNN model achieved about 83% accuracy. After applying GridSearchCV hyperparameter tuning, the Final Optimized Random Forest Classification Model reached approximately 89% accuracy.
Dataset: Global AI & Data Jobs Salary Dataset (2020β2026)
Created by Maya Cheruty | Reichman University | 2026
- Downloads last month
- -













