Tabular Classification
Scikit-learn
English
regression
classification
salary-prediction
stack-overflow
gradient-boosting
random-forest
logistic-regression
clustering
feature-engineering
tabular
Instructions to use rotemvahava/stackoverflow-salary-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use rotemvahava/stackoverflow-salary-predictor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("rotemvahava/stackoverflow-salary-predictor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -71,7 +71,9 @@ The EDA started with cleaning the data: imputing missing values (median for nume
|
|
| 71 |
|
| 72 |
The raw salary column had extreme values that would have distorted any model — entries below $5K (probably typos or freelance side gigs) and above $500K (likely C-level executives or data entry errors). I capped the salary range to $5K–$500K, which removed the noise while keeping the meaningful tail of high earners.
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
| 75 |
|
| 76 |
After this filtering, I was left with 45,804 developers with reliable salary data.
|
| 77 |
|
|
@@ -290,7 +292,7 @@ Gradient Boosting wins consistently across all three classes — AUC = 0.909 for
|
|
| 290 |
|
| 291 |
Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).
|
| 292 |
|
| 293 |
-

|
| 76 |
+
|
| 77 |
|
| 78 |
After this filtering, I was left with 45,804 developers with reliable salary data.
|
| 79 |
|
|
|
|
| 292 |
|
| 293 |
Precision-Recall curves complement the ROC analysis with another view on model quality, especially useful when looking at the tradeoff between catching actual positives (recall) and being right when predicting positives (precision).
|
| 294 |
|
| 295 |
+

|
| 296 |
|
| 297 |
The pattern matches what I saw in ROC: Gradient Boosting wins across all three classes — AP of 0.832 for Low, 0.635 for Mid, and 0.858 for High. Logistic Regression is right behind, and Random Forest comes in last but only by a small margin. The Mid class consistently has the lowest AP (~0.61–0.64) across all models, confirming the same pattern from ROC and the confusion matrices: Mid sits between Low and High without clean boundaries, so it's harder for any model to be both precise and complete about it. Even the worst Mid curve at AP = 0.61 is nearly twice the no-skill baseline of 0.33, confirming the models add real predictive value.
|
| 298 |
|