File size: 17,076 Bytes

---
license: mit
language:
  - en
tags:
  - regression
  - classification
  - salary-prediction
  - stack-overflow
  - gradient-boosting
  - random-forest
  - logistic-regression
  - clustering
  - feature-engineering
  - tabular
datasets:
  - stackoverflow/stack-overflow-2023-developers-survey
metrics:
  - accuracy
  - f1
  - r2
library_name: sklearn
pipeline_tag: tabular-classification
---

# Stack Overflow Developer Salary Predictor

**Author:** Rotem Vahava  
**Dataset:** [Stack Overflow Developer Survey 2023](https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey)  
**Date:** April 2026

---

## Executive Summary

I built two machine learning models that predict developer salary from the Stack Overflow 2023 survey: a **regression model** that predicts the exact salary in dollars (R² = 0.545, MAE ≈ $30K), and a **classification model** that predicts which salary tier a developer falls into Low, Mid, or High (accuracy 70.2%, F1-macro 0.70). Both winning models are Gradient Boosting, trained on 45,804 developers across 51 engineered features.

The most surprising finding was that all three classification algorithms (Logistic Regression, Random Forest, Gradient Boosting) converged to within 1.5% of each other strong evidence that the salary signal in this dataset has a natural ceiling around 70% accuracy. The features that would push beyond this (specific company, exact role level, negotiation skill) simply aren't in the survey.

The biggest single driver of salary turned out to be **Country**, accounting for ~33% of the model's predictive power.

---

## Presentation Video

<video src="https://huggingface.co/rotemvahava/stackoverflow-salary-predictor/resolve/main/rotem_video_final.mp4" controls width="720"></video>
---
## Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Wm4u7864Gq6GQOiJLNCwqjJSvXtEqAPC)

The complete project notebook with all code, outputs, visualizations, and explanations is included in this repository. [Download the notebook](https://huggingface.co/rotemvahava/stackoverflow-salary-predictor/blob/main/Assignment_2_Rotem_Vahava.ipynb) to see every step of the analysis end-to-end.

## Project Overview

This project builds a complete end-to-end machine learning pipeline that predicts developer compensation using the Stack Overflow Developer Survey 2023 a dataset of ~89,000 developers worldwide with 84 raw features. From those, I selected 16 features most relevant to salary prediction and ended up with 45,804 developers after cleaning.

The same dataset is used for two prediction tasks:
- **Regression** - predicting the exact annual salary in USD.
- **Classification** - predicting which salary tier (Low / Mid / High) a developer belongs to.

---

## Part 2: Exploratory Data Analysis

The EDA started with cleaning the data: imputing missing values (median for numeric, "Unknown" for categorical), removing extreme outliers, and visualizing distributions of key features.

### Outlier removal

The raw salary column had extreme values that would have distorted any model entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5K–$500K, which removed the noise while keeping the meaningful tail of high earners.


![image](https://cdn-uploads.huggingface.co/production/uploads/69d8c4ebac66ae8128c346d8/kbq48nMVx4QsTC4ppcNIj.png)


After this filtering, I was left with 45,804 developers with reliable salary data.

### Salary distribution

The target variable is heavily right-skewed most developers earn between $30K and $100K, but a long tail extends to $500K. This skew motivated the log transform applied in Part 4.

![Salary Distribution](salary_distribution.png)

### Five research questions

The EDA was structured around five focused questions, each answered with a specific visualization.

---

### Q1 — Does formal education actually pay off?

I examined whether developers with advanced degrees (Master's, PhD) earn meaningfully more than those without formal education.

![Education vs Salary](eda_education_vs_salary.png)

**Finding:** Education has only a moderate effect on salary in the tech industry. While Master's and PhD holders show slightly higher medians, the spread within each education level is enormous. The boxplot reveals that self taught developers can out-earn PhD holders, suggesting that formal education is a stepping stone but not a salary ceiling.

---

### Q2 — Is there a limit to how much experience pays off?

I plotted years of professional coding against salary with a LOWESS trendline to see if the relationship is linear or saturates at some point.

![Experience vs Salary](eda_experience_vs_salary.png)

**Finding:** The relationship is clearly non-linear. Salary grows steeply for the first 10–15 years of professional experience, then plateaus. After ~20 years, the median salary barely increases. This non-linearity motivated the inclusion of polynomial-friendly tree models (Random Forest, Gradient Boosting) in Part 5.

---

### Q3 — Does remote work affect earning potential?

I compared salary distributions across three work arrangements: Remote, Hybrid, and In-person.

![Remote Work vs Salary](eda_remote_vs_salary.png)

**Finding:** Fully remote developers show the highest median salary, with Hybrid in the middle and In-person at the bottom. The gap is meaningful remote workers earn roughly 20–30% more at the median. This likely reflects two effects: senior developers get more remote flexibility, and remote work allows access to higher-paying global markets.

---

### Q4 — Does age (and seniority) keep paying through retirement age?

I plotted median salary by age group to see whether earnings keep growing or plateau in later career stages.

![Age vs Salary](eda_age_vs_salary.png)

**Finding:** Salary grows steeply from "18–24" through "35–44" the prime career-building years and then plateaus. The "55–64" and "65+" groups do not show further increases, suggesting that seniority benefits cap once developers hit senior/staff levels.

---

### Q5 — Does starting to code early translate into higher pay later?

I compared total years of coding (including hobby) against years of professional coding to see if early starters earn more later.

![image](https://cdn-uploads.huggingface.co/production/uploads/69d8c4ebac66ae8128c346d8/5VYs9N1hcpv6Df-6UE4Wi.png)

**Finding:** Professional years matter much more than total years. A developer who started coding as a teenager but has 5 years of professional experience earns roughly the same as someone who started coding professionally at age 30 with 5 years of experience. The "hobby head start" doesn't translate into a measurable salary advantage at the same level of professional tenure.

### Full feature correlation

The correlation heatmap below shows how all 16 features relate to each other and to salary.

![Correlation Heatmap](correlation_heatmap.png)

The "experience cluster" (Age, YearsCode, YearsCodePro, WorkExp) is heavily intercorrelated, which I addressed later through derived features and tree-based models that handle multicollinearity better than linear ones.

---

## Part 3: Baseline Linear Regression

The baseline used 13 features (3 numeric + 10 categorical), trained with default parameters and evaluated on a held-out 20% test set.

**Results:**

| Metric | Value |
| --- | --- |
| MAE | $32,583 |
| RMSE | $50,086 |
| R² (test) | 0.498 |
| R² (train) | 0.518 |
| Train-Test gap | 0.020 |

The small gap between train and test R² confirmed there was no overfitting — the model generalized well to unseen data.

### Diagnostic plots

![Baseline Diagnostics](baseline_diagnostics.png)

Three views of model performance: Actual vs Predicted (deviations from the diagonal show the model under-predicts top earners), Residuals vs Predicted (a slight funnel shape suggests heteroscedasticity), and Distribution of Residuals (right-skewed tail confirms the model misses high salaries by a lot).

### Feature importance — coefficients

![Baseline Coefficients](baseline_coefficients.png)

Country (especially the USA), professional experience, and senior role indicators dominated the top of the ranking confirming the EDA story.

---

## Part 4: Feature Engineering

Five engineering steps were applied to address the weaknesses observed in the baseline.

### Log transform of the target

The skewed salary distribution was log transformed to make it closer to normal exactly what Linear Regression's assumptions prefer.

![Log Transform](log_transform.png)

### Multi-hot encoding for tech stack

The semicolon-separated columns `LanguageHaveWorkedWith` and `DatabaseHaveWorkedWith` were converted into 30 binary features (top 15 languages + top 15 databases).

### Derived numeric features

- `NumLanguages` and `NumDatabases` count of technologies each developer uses.
- `ExperienceRatio` — proportion of total coding years spent professionally.
- `HobbyYears` — years coding before going professional.

### K-Means clustering

K-Means was applied to the developer profile features (experience + tech versatility). The elbow method suggested K = 4 clusters.

![K-Means Elbow](kmeans_elbow.png)

The clusters were validated by visualizing them in 2D using PCA.

![PCA Clusters](pca_clusters.png)

The cluster assignment (`Cluster`) and the distance from each developer to their cluster's centroid (`DistToCentroid`) were added as new features.

### Final engineered dataset

The original 16 columns grew into 51 informative features — 35 new features in total.

---

## Part 5: Three Improved Regression Models

Three different regression algorithms were trained and compared on the engineered dataset, with all metrics computed in dollars after reversing the log-transform.

### Performance comparison

| Model | MAE | RMSE | R² |
| --- | --- | --- | --- |
| Linear Regression (Baseline) | $32,583 | $50,086 | 0.498 |
| Linear Regression (Engineered) | $29,925 | $48,432 | 0.530 |
| Random Forest | $29,786 | $48,840 | 0.523 |
| **Gradient Boosting (Winner)** | **$28,898** | **$47,638** | **0.546** |
![Regression Comparison](regression_comparison.png)

### Predictions vs reality

![Actual vs Predicted - All Models](regression_actual_vs_predicted.png)

All three models show similar patterns: predictions cluster well in the middle range but struggle to reach very high or very low salaries — the same data ceiling effect that limits accuracy.

### Feature importance

![Feature Importance - Tree Models](regression_feature_importance.png)

Both tree-based models agreed on the top drivers: Country (especially USA), professional experience, and the engineered cluster features.

---

## Part 6: Upload Best Regression Model

The Gradient Boosting Regressor pipeline was saved as `gradient_boosting_salary_regressor.pkl` and uploaded to this repository.

---

## Part 7: Regression to Classification

The continuous salary target was converted into 3 ordinal classes using **tertile binning** (33rd and 67th percentiles):

- **Low** — salaries below $57,249 (15,110 developers)
- **Mid** — salaries between $57,249 and $105,517 (15,575 developers)
- **High** — salaries above $105,579 (15,119 developers)

### Class balance

![Class Distribution](class_distribution.png)

The classes ended up nearly perfectly balanced (33% / 34% / 33%), which made accuracy a meaningful metric without needing rebalancing techniques.

---

## Part 8: Three Classification Models

### Performance comparison

| Model | Accuracy | F1-macro |
| --- | :-: | :-: |
| Logistic Regression | 70.11% | 0.7018 |
| Random Forest | 68.54% | 0.6886 |
| **Gradient Boosting (Winner)** | **70.18%** | **0.7034** |

### Confusion matrices

![Confusion Matrices](confusion_matrices.png)

A reassuring pattern emerges across all three models: most mistakes happen between adjacent classes (Low ↔ Mid or Mid ↔ High). The dangerous "extreme" misclassifications (Low ↔ High) only happen in ~3% of test predictions. This means the models inherently grasp the ordinal structure of salary tiers, even though they were never explicitly told the classes are ordered.

### ROC curves

![ROC Curves](roc_curves.png)

Gradient Boosting wins consistently across all three classes — AUC = 0.909 for both Low and High, 0.796 for Mid. The pattern across all models is the same: Low and High have AUCs around 0.90 (clean extremes that are easy to identify), while Mid sits noticeably lower at ~0.79 (the in-between class without clean boundaries).

### Precision-Recall curves

Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).

![image](https://cdn-uploads.huggingface.co/production/uploads/69d8c4ebac66ae8128c346d8/JZ8pzJDRpAPeoU9QyCsnX.png)

The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes — AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61–0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.

### Feature importance

![Classification Feature Importance](classification_feature_importance.png)

The top features for classification mirror the regression task — Country dominates, followed by `YearsCodePro` and the engineered clustering features. This consistency across both prediction tasks confirms the salary signal is robust: the same factors that determine the dollar amount also determine the salary tier.

---

## Key Insights

### Top salary drivers

1. **Country (especially USA)**- by far the biggest driver, ~33% of feature importance. Being in the US matters more than almost anything else.
2. **YearsCodePro** -the strongest numeric predictor, confirming the "experience pays" intuition.
3. **DistToCentroid** -the K-Means cluster feature made it into the top 6 features for Random Forest, validating that the unsupervised clustering work added real signal.
4. **Specific languages** -`lang_PHP` shows up as a negative predictor (PHP correlates with lower-paying roles); other languages like Rust appear as positive markers.

### Why ~70% accuracy is the realistic ceiling

All three classification algorithms Logistic Regression, Random Forest, and Gradient Boosting converged to within 1.5% of each other. This consistency strongly suggests a **data ceiling, not an algorithm ceiling**. The biggest predictors of salary are not in the survey: specific company name, exact role level (Junior/Senior/Staff), negotiation skill, and individual performance reviews.

With the available features, ~70% accuracy and R² ~0.55 are the realistic best — well above the 34% naive baseline (always-predict-majority) and well below what would be achievable if company-level signals were available.

---

## Repository Contents

| File | Description |
| --- | --- |
| `gradient_boosting_salary_regressor.pkl` | Winning regression model — predicts dollar salary |
| `salary_class_classifier.pkl` | Winning classification model — predicts salary tier |
| `README.md` | This file |
| `*.png` | All visualizations referenced in this README |

---

## How to Use the Models

### Loading the regression model

```python
import pickle
import numpy as np
...
```

with open("gradient_boosting_salary_regressor.pkl", "rb") as f:
    reg_model = pickle.load(f)

# X_new should be a DataFrame with the same columns as the training data
predicted_log_salary = reg_model.predict(X_new)
predicted_salary_usd = np.expm1(predicted_log_salary)

print(f"Predicted salary: ${predicted_salary_usd[0]:,.2f}")
```

### Loading the classification model

```python
import pickle
import pandas as pd

with open("salary_class_classifier.pkl", "rb") as f:
    clf_model = pickle.load(f)

predicted_class = clf_model.predict(X_new)
predicted_proba = clf_model.predict_proba(X_new)

print(f"Predicted tier: {predicted_class[0]}")
print(f"Class probabilities: {dict(zip(clf_model.classes_, predicted_proba[0]))}")
```

---

## Dataset

**Source:** [Stack Overflow Developer Survey 2023](https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey) on Kaggle  
**Original size:** ~89,000 respondents, 84 features in the raw dataset  
**Selected for analysis:** 16 features chosen for relevance to salary prediction  
**After cleaning:** 45,804 developers with valid salary data ($5K–$500K range)

---

## Tech Stack

```
pandas
numpy
scikit-learn
matplotlib
seaborn
```