Tabular Classification
Scikit-learn
English
regression
classification
salary-prediction
stack-overflow
gradient-boosting
random-forest
logistic-regression
clustering
feature-engineering
tabular
Instructions to use rotemvahava/stackoverflow-salary-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use rotemvahava/stackoverflow-salary-predictor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("rotemvahava/stackoverflow-salary-predictor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
File size: 17,076 Bytes
1a25ef7 44dcc27 a847677 44dcc27 a847677 44dcc27 a847677 1a25ef7 43e9fc2 f640a4f 78d2904 f640a4f 78d2904 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 62eb75e 44dcc27 62eb75e 012b116 62eb75e 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 cecd210 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 2e3197c 1a25ef7 2e3197c 1a25ef7 62eb75e 012b116 62eb75e 1a25ef7 44dcc27 1a25ef7 44dcc27 1a25ef7 a847677 1a25ef7 d05c0f7 1a25ef7 5847331 a847677 5847331 3cd1b2d a847677 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | ---
license: mit
language:
- en
tags:
- regression
- classification
- salary-prediction
- stack-overflow
- gradient-boosting
- random-forest
- logistic-regression
- clustering
- feature-engineering
- tabular
datasets:
- stackoverflow/stack-overflow-2023-developers-survey
metrics:
- accuracy
- f1
- r2
library_name: sklearn
pipeline_tag: tabular-classification
---
# Stack Overflow Developer Salary Predictor
**Author:** Rotem Vahava
**Dataset:** [Stack Overflow Developer Survey 2023](https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey)
**Date:** April 2026
---
## Executive Summary
I built two machine learning models that predict developer salary from the Stack Overflow 2023 survey: a **regression model** that predicts the exact salary in dollars (RΒ² = 0.545, MAE β $30K), and a **classification model** that predicts which salary tier a developer falls into Low, Mid, or High (accuracy 70.2%, F1-macro 0.70). Both winning models are Gradient Boosting, trained on 45,804 developers across 51 engineered features.
The most surprising finding was that all three classification algorithms (Logistic Regression, Random Forest, Gradient Boosting) converged to within 1.5% of each other strong evidence that the salary signal in this dataset has a natural ceiling around 70% accuracy. The features that would push beyond this (specific company, exact role level, negotiation skill) simply aren't in the survey.
The biggest single driver of salary turned out to be **Country**, accounting for ~33% of the model's predictive power.
---
## Presentation Video
<video src="https://huggingface.co/rotemvahava/stackoverflow-salary-predictor/resolve/main/rotem_video_final.mp4" controls width="720"></video>
---
## Notebook
[](https://colab.research.google.com/drive/1Wm4u7864Gq6GQOiJLNCwqjJSvXtEqAPC)
The complete project notebook with all code, outputs, visualizations, and explanations is included in this repository. [Download the notebook](https://huggingface.co/rotemvahava/stackoverflow-salary-predictor/blob/main/Assignment_2_Rotem_Vahava.ipynb) to see every step of the analysis end-to-end.
## Project Overview
This project builds a complete end-to-end machine learning pipeline that predicts developer compensation using the Stack Overflow Developer Survey 2023 a dataset of ~89,000 developers worldwide with 84 raw features. From those, I selected 16 features most relevant to salary prediction and ended up with 45,804 developers after cleaning.
The same dataset is used for two prediction tasks:
- **Regression** - predicting the exact annual salary in USD.
- **Classification** - predicting which salary tier (Low / Mid / High) a developer belongs to.
---
## Part 2: Exploratory Data Analysis
The EDA started with cleaning the data: imputing missing values (median for numeric, "Unknown" for categorical), removing extreme outliers, and visualizing distributions of key features.
### Outlier removal
The raw salary column had extreme values that would have distorted any model entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5Kβ$500K, which removed the noise while keeping the meaningful tail of high earners.

After this filtering, I was left with 45,804 developers with reliable salary data.
### Salary distribution
The target variable is heavily right-skewed most developers earn between $30K and $100K, but a long tail extends to $500K. This skew motivated the log transform applied in Part 4.

### Five research questions
The EDA was structured around five focused questions, each answered with a specific visualization.
---
### Q1 β Does formal education actually pay off?
I examined whether developers with advanced degrees (Master's, PhD) earn meaningfully more than those without formal education.

**Finding:** Education has only a moderate effect on salary in the tech industry. While Master's and PhD holders show slightly higher medians, the spread within each education level is enormous. The boxplot reveals that self taught developers can out-earn PhD holders, suggesting that formal education is a stepping stone but not a salary ceiling.
---
### Q2 β Is there a limit to how much experience pays off?
I plotted years of professional coding against salary with a LOWESS trendline to see if the relationship is linear or saturates at some point.

**Finding:** The relationship is clearly non-linear. Salary grows steeply for the first 10β15 years of professional experience, then plateaus. After ~20 years, the median salary barely increases. This non-linearity motivated the inclusion of polynomial-friendly tree models (Random Forest, Gradient Boosting) in Part 5.
---
### Q3 β Does remote work affect earning potential?
I compared salary distributions across three work arrangements: Remote, Hybrid, and In-person.

**Finding:** Fully remote developers show the highest median salary, with Hybrid in the middle and In-person at the bottom. The gap is meaningful remote workers earn roughly 20β30% more at the median. This likely reflects two effects: senior developers get more remote flexibility, and remote work allows access to higher-paying global markets.
---
### Q4 β Does age (and seniority) keep paying through retirement age?
I plotted median salary by age group to see whether earnings keep growing or plateau in later career stages.

**Finding:** Salary grows steeply from "18β24" through "35β44" the prime career-building years and then plateaus. The "55β64" and "65+" groups do not show further increases, suggesting that seniority benefits cap once developers hit senior/staff levels.
---
### Q5 β Does starting to code early translate into higher pay later?
I compared total years of coding (including hobby) against years of professional coding to see if early starters earn more later.

**Finding:** Professional years matter much more than total years. A developer who started coding as a teenager but has 5 years of professional experience earns roughly the same as someone who started coding professionally at age 30 with 5 years of experience. The "hobby head start" doesn't translate into a measurable salary advantage at the same level of professional tenure.
### Full feature correlation
The correlation heatmap below shows how all 16 features relate to each other and to salary.

The "experience cluster" (Age, YearsCode, YearsCodePro, WorkExp) is heavily intercorrelated, which I addressed later through derived features and tree-based models that handle multicollinearity better than linear ones.
---
## Part 3: Baseline Linear Regression
The baseline used 13 features (3 numeric + 10 categorical), trained with default parameters and evaluated on a held-out 20% test set.
**Results:**
| Metric | Value |
| --- | --- |
| MAE | $32,583 |
| RMSE | $50,086 |
| RΒ² (test) | 0.498 |
| RΒ² (train) | 0.518 |
| Train-Test gap | 0.020 |
The small gap between train and test RΒ² confirmed there was no overfitting β the model generalized well to unseen data.
### Diagnostic plots

Three views of model performance: Actual vs Predicted (deviations from the diagonal show the model under-predicts top earners), Residuals vs Predicted (a slight funnel shape suggests heteroscedasticity), and Distribution of Residuals (right-skewed tail confirms the model misses high salaries by a lot).
### Feature importance β coefficients

Country (especially the USA), professional experience, and senior role indicators dominated the top of the ranking confirming the EDA story.
---
## Part 4: Feature Engineering
Five engineering steps were applied to address the weaknesses observed in the baseline.
### Log transform of the target
The skewed salary distribution was log transformed to make it closer to normal exactly what Linear Regression's assumptions prefer.

### Multi-hot encoding for tech stack
The semicolon-separated columns `LanguageHaveWorkedWith` and `DatabaseHaveWorkedWith` were converted into 30 binary features (top 15 languages + top 15 databases).
### Derived numeric features
- `NumLanguages` and `NumDatabases` count of technologies each developer uses.
- `ExperienceRatio` β proportion of total coding years spent professionally.
- `HobbyYears` β years coding before going professional.
### K-Means clustering
K-Means was applied to the developer profile features (experience + tech versatility). The elbow method suggested K = 4 clusters.

The clusters were validated by visualizing them in 2D using PCA.

The cluster assignment (`Cluster`) and the distance from each developer to their cluster's centroid (`DistToCentroid`) were added as new features.
### Final engineered dataset
The original 16 columns grew into 51 informative features β 35 new features in total.
---
## Part 5: Three Improved Regression Models
Three different regression algorithms were trained and compared on the engineered dataset, with all metrics computed in dollars after reversing the log-transform.
### Performance comparison
| Model | MAE | RMSE | RΒ² |
| --- | --- | --- | --- |
| Linear Regression (Baseline) | $32,583 | $50,086 | 0.498 |
| Linear Regression (Engineered) | $29,925 | $48,432 | 0.530 |
| Random Forest | $29,786 | $48,840 | 0.523 |
| **Gradient Boosting (Winner)** | **$28,898** | **$47,638** | **0.546** |

### Predictions vs reality

All three models show similar patterns: predictions cluster well in the middle range but struggle to reach very high or very low salaries β the same data ceiling effect that limits accuracy.
### Feature importance

Both tree-based models agreed on the top drivers: Country (especially USA), professional experience, and the engineered cluster features.
---
## Part 6: Upload Best Regression Model
The Gradient Boosting Regressor pipeline was saved as `gradient_boosting_salary_regressor.pkl` and uploaded to this repository.
---
## Part 7: Regression to Classification
The continuous salary target was converted into 3 ordinal classes using **tertile binning** (33rd and 67th percentiles):
- **Low** β salaries below $57,249 (15,110 developers)
- **Mid** β salaries between $57,249 and $105,517 (15,575 developers)
- **High** β salaries above $105,579 (15,119 developers)
### Class balance

The classes ended up nearly perfectly balanced (33% / 34% / 33%), which made accuracy a meaningful metric without needing rebalancing techniques.
---
## Part 8: Three Classification Models
### Performance comparison
| Model | Accuracy | F1-macro |
| --- | :-: | :-: |
| Logistic Regression | 70.11% | 0.7018 |
| Random Forest | 68.54% | 0.6886 |
| **Gradient Boosting (Winner)** | **70.18%** | **0.7034** |
### Confusion matrices

A reassuring pattern emerges across all three models: most mistakes happen between adjacent classes (Low β Mid or Mid β High). The dangerous "extreme" misclassifications (Low β High) only happen in ~3% of test predictions. This means the models inherently grasp the ordinal structure of salary tiers, even though they were never explicitly told the classes are ordered.
### ROC curves

Gradient Boosting wins consistently across all three classes β AUC = 0.909 for both Low and High, 0.796 for Mid. The pattern across all models is the same: Low and High have AUCs around 0.90 (clean extremes that are easy to identify), while Mid sits noticeably lower at ~0.79 (the in-between class without clean boundaries).
### Precision-Recall curves
Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).

The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes β AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61β0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.
### Feature importance

The top features for classification mirror the regression task β Country dominates, followed by `YearsCodePro` and the engineered clustering features. This consistency across both prediction tasks confirms the salary signal is robust: the same factors that determine the dollar amount also determine the salary tier.
---
## Key Insights
### Top salary drivers
1. **Country (especially USA)**- by far the biggest driver, ~33% of feature importance. Being in the US matters more than almost anything else.
2. **YearsCodePro** -the strongest numeric predictor, confirming the "experience pays" intuition.
3. **DistToCentroid** -the K-Means cluster feature made it into the top 6 features for Random Forest, validating that the unsupervised clustering work added real signal.
4. **Specific languages** -`lang_PHP` shows up as a negative predictor (PHP correlates with lower-paying roles); other languages like Rust appear as positive markers.
### Why ~70% accuracy is the realistic ceiling
All three classification algorithms Logistic Regression, Random Forest, and Gradient Boosting converged to within 1.5% of each other. This consistency strongly suggests a **data ceiling, not an algorithm ceiling**. The biggest predictors of salary are not in the survey: specific company name, exact role level (Junior/Senior/Staff), negotiation skill, and individual performance reviews.
With the available features, ~70% accuracy and RΒ² ~0.55 are the realistic best β well above the 34% naive baseline (always-predict-majority) and well below what would be achievable if company-level signals were available.
---
## Repository Contents
| File | Description |
| --- | --- |
| `gradient_boosting_salary_regressor.pkl` | Winning regression model β predicts dollar salary |
| `salary_class_classifier.pkl` | Winning classification model β predicts salary tier |
| `README.md` | This file |
| `*.png` | All visualizations referenced in this README |
---
## How to Use the Models
### Loading the regression model
```python
import pickle
import numpy as np
...
```
with open("gradient_boosting_salary_regressor.pkl", "rb") as f:
reg_model = pickle.load(f)
# X_new should be a DataFrame with the same columns as the training data
predicted_log_salary = reg_model.predict(X_new)
predicted_salary_usd = np.expm1(predicted_log_salary)
print(f"Predicted salary: ${predicted_salary_usd[0]:,.2f}")
```
### Loading the classification model
```python
import pickle
import pandas as pd
with open("salary_class_classifier.pkl", "rb") as f:
clf_model = pickle.load(f)
predicted_class = clf_model.predict(X_new)
predicted_proba = clf_model.predict_proba(X_new)
print(f"Predicted tier: {predicted_class[0]}")
print(f"Class probabilities: {dict(zip(clf_model.classes_, predicted_proba[0]))}")
```
---
## Dataset
**Source:** [Stack Overflow Developer Survey 2023](https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey) on Kaggle
**Original size:** ~89,000 respondents, 84 features in the raw dataset
**Selected for analysis:** 16 features chosen for relevance to salary prediction
**After cleaning:** 45,804 developers with valid salary data ($5Kβ$500K range)
---
## Tech Stack
```
pandas
numpy
scikit-learn
matplotlib
seaborn
``` |