rotemvahava
/

stackoverflow-salary-predictor

@@ -71,7 +71,9 @@ The EDA started with cleaning the data: imputing missing values (median for nume
 The raw salary column had extreme values that would have distorted any model — entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5K–$500K, which removed the noise while keeping the meaningful tail of high earners.
-![Outlier Cleaning](outliers_cleaning.png)
 After this filtering, I was left with 45,804 developers with reliable salary data.
@@ -290,7 +292,7 @@ Gradient Boosting wins consistently across all three classes — AUC = 0.909 for
 Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).
-![Precision-Recall Curves](precision_recall_curves.png)
 The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes — AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61–0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.

 The raw salary column had extreme values that would have distorted any model — entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5K–$500K, which removed the noise while keeping the meaningful tail of high earners.
+![image](https://cdn-uploads.huggingface.co/production/uploads/69d8c4ebac66ae8128c346d8/kbq48nMVx4QsTC4ppcNIj.png)
 After this filtering, I was left with 45,804 developers with reliable salary data.
 Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).
+![image](https://cdn-uploads.huggingface.co/production/uploads/69d8c4ebac66ae8128c346d8/JZ8pzJDRpAPeoU9QyCsnX.png)
 The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes — AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61–0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.