GSK Copay Card Fraud Detection System β v4 Group-Aware
Product Focus: Trelegy Ellipta (configurable for Nucala / any GSK product)
Methodology: Hybrid Rules + Isolation Forest + SHAP Explainability + Hierarchical Summaries + Group-Aware Benefit Validation + EDA
Architecture: Drug-agnostic, ground-truth-optional, vendor-format-agnostic, production-ready
Analytical Levels: Transaction β Patient β HCP β Pharmacy
Group-Aware Validation: Group 8141 (Legacy) vs Group 8200 / 2025 benefit designs
Overview
This system detects fraudulent and suspicious copay card claims in GSK pharmaceutical transaction data using a 4-level hierarchical analytical framework:
| Level | What It Detects | Key Features |
|---|---|---|
| Transaction | Per-claim anomalies | Gap between fills, quantity, days supply, benefit amount, OOP cost, NDC switch |
| Patient | Behavioral patterns | One-and-done patients, active duration, avg gap between fills, short/long gap % |
| HCP | Prescriber-driven fraud | Suspicious specialty, one-and-done %, patient concentration, avg benefit per patient |
| Pharmacy | Pharmacy-centric rings | Active/closed flag, HCP concentration, one-and-done %, avg benefit, fraud risk score |
Under the hood, the system combines:
- 35 hard-coded business rules (23 original + 12 v4 group-aware rules)
- Isolation Forest unsupervised anomaly detection trained on rule-clean data
- SHAP explainability for every flagged claim
- Hierarchical summary exports for investigative lens views
- Exploratory Data Analysis (EDA) module for pre-modeling data profiling
The pipeline supports any vendor format β ELAAD, APLD, IQVIA, CMS DMR, generic CSV. It auto-discovers column names via synonym mapping, handles missing columns gracefully, and produces a schema report showing what it found and what it missed.
Project Structure
gsk_copay_fraud/
βββ config.py # Product config + column synonym mappings + FEATURE_DEPENDENCIES
βββ data_ingestion.py # Schema discovery + vendor-agnostic ingestion
βββ feature_engineering_v2.py # 60+ features with graceful degradation
βββ fraud_detection_pipeline_v3.py # Full pipeline (35 rules + IF + SHAP + hierarchical summaries)
βββ eda.py # NEW: Exploratory Data Analysis module
βββ run_all_v3.py # CLI runner (now supports --run-eda)
βββ generate_elaad_test_data.py # Synthetic ELAAD-style test data generator with embedded fraud
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ data/ # Place raw GSK data here
β βββ elaad_test_trelegy.csv
βββ results/ # All outputs generated here
βββ eda/ # NEW: EDA outputs (JSON + PNG)
β βββ 00_data_quality.json
β βββ 01_temporal_analysis.png
β βββ 02_financial_analysis.png
β βββ 03_categorical_analysis.png
β βββ 04_geographic_analysis.png
β βββ 05_correlation_analysis.png
β βββ 06_hierarchical_analysis.png
β βββ 07_feature_engineered_eda.png
βββ investigation_queue_top500.csv
βββ scored_claims_full.csv
βββ transaction_level_summary.csv
βββ hcp_level_summary.csv
βββ pharmacy_level_summary.csv
βββ patient_level_summary.csv
βββ feature_importance_shap.csv
βββ metrics.json
βββ schema_report.json
βββ 01_evaluation_metrics.png
βββ 02_shap_summary.png
βββ 03_shap_bar_importance.png
βββ 04_rule_breakdown.png
βββ 05_risk_tier_distribution.png
βββ 06_hcp_summary.png
βββ 07_pharmacy_summary.png
βββ 08_patient_summary.png
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Run EDA (Pre-Modeling Data Profiling)
# Standalone EDA
python eda.py --data-path data/elaad_test_trelegy.csv --file-type csv --vendor-format elaad_apld
# EDA + full fraud detection pipeline
python run_all_v3.py --data-path data/elaad_test_trelegy.csv --run-eda
# EDA only (skip feature-engineered EDA for speed)
python run_all_v3.py --data-path data/elaad_test_trelegy.csv --run-eda --skip-post-engineering-eda
3. Run Fraud Detection Pipeline
# Auto-detect everything (recommended)
python run_all_v3.py \
--data-path data/vendor_file.txt.gz
# CSV file
python run_all_v3.py \
--data-path data/gsk_copay_transactions.csv \
--file-type csv
# Gzipped TXT (common GSK format)
python run_all_v3.py \
--data-path data/GSK_COPAY_TRANSACTION_DAILY_20250725.TXT.GZ \
--file-type txt.gz
# Force a specific vendor format (skips auto-discovery)
python run_all_v3.py \
--data-path data/vendor_file.csv \
--vendor-format gsk_iqvia \
--contamination 0.03
# Adjust anomaly rate for unusual datasets
python run_all_v3.py \
--data-path data/vendor_file.csv \
--contamination 0.05
4. Generate & Test on Synthetic ELAAD Data
# Generate test data
python generate_elaad_test_data.py
# Creates data/elaad_test_trelegy.csv with embedded fraud patterns
# Run pipeline on synthetic data (high contamination needed because ~40% fraud)
python run_all_v3.py \
--data-path data/elaad_test_trelegy.csv \
--file-type csv \
--contamination 0.40
EDA Module (eda.py)
The EDA module runs before the fraud detection pipeline to provide data quality assessment, trend analysis, and distribution profiling. It is critical for:
- Identifying data quality issues before modeling
- Understanding seasonal/temporal fraud patterns
- Validating financial distributions against expectations
- Detecting categorical imbalances (scenarios, specialties, groups)
- Finding geographic anomalies
- Establishing hierarchical relationship baselines
EDA Coverage
| Analysis | Output | Description |
|---|---|---|
| Data Quality | 00_data_quality.json |
Missing values, cardinality, numeric ranges, outlier counts, cross-state summary |
| Temporal | 01_temporal_analysis.png + .json |
Monthly claim volume, day-of-week, quarterly, monthly benefit trend |
| Financial | 02_financial_analysis.png |
Benefit, copay, OOP, usual customary distributions + boxplots; benefit by scenario/group |
| Categorical | 03_categorical_analysis.png + .json |
Top 15 distributions: specialty, scenario, NDC, pharmacy type, group, insurance |
| Geographic | 04_geographic_analysis.png + .json |
Top 20 states, patient vs pharmacy state heatmap, cross-state claim count |
| Correlation | 05_correlation_analysis.png + .json |
Feature correlation matrix (lower triangle), top 30 correlated pairs |
| Hierarchical | 06_hierarchical_analysis.png + .json |
Claims per patient, patients per HCP, HCPs per pharmacy, pharmacies per patient |
| Feature-Engineered | 07_feature_engineered_eda.png |
Cap utilization ratio, early refill flags, excess payment, scenario not covered |
Running EDA
# Standalone
python eda.py --data-path data/elaad_test_trelegy.csv --file-type csv --vendor-format elaad_apld
# Options
python eda.py \
--data-path data/elaad_test_trelegy.csv \
--file-type csv \
--vendor-format elaad_apld \
--results-dir results \
--skip-post-engineering # Skip feature-engineered EDA for speed
Vendor Format Handling
The Problem
Vendor files rarely match the idealised spec. A column named IQVIA_PATIENT_ID in the spec might appear as:
PATIENT_ID(ELAAD format)MEMBER_ID(APLD format)PATIENTID(no underscore)PAT ID(space instead of underscore)Patient ID(mixed case)PATIENT_KEY(different suffix)
The Solution: Schema Discovery
The pipeline uses COLUMN_SYNONYMS in config.py β a dictionary where each internal column name maps to a list of possible raw names (20+ synonyms per column). When a file is loaded:
- Scan header β collect all raw column names
- Normalize β uppercase, strip whitespace, replace underscores/spaces with single space
- Match β for each internal column, try synonyms in order of preference
- Report β log what was mapped and what was missing
- Continue β pipeline runs with whatever columns are available
Example Schema Report
2025-04-05 12:00:00 [INFO] Vendor format: detected=generic_csv, requested=auto
2025-04-05 12:00:00 [INFO] Schema discovery: mapped 38 / 50 internal columns
2025-04-05 12:00:00 [INFO] [pharmacy] 6/7 present. Missing: ['pharmacy_subcategory']
2025-04-05 12:00:00 [INFO] [reject] 1/3 present. Missing: ['reject_description', 'reject_type']
2025-04-05 12:00:00 [WARNING] Missing columns (12): ['hcp_id', 'record_type', 'other_coverage', ...]
2025-04-05 12:00:00 [INFO] Feature engineering on 85,432 claims...
2025-04-05 12:00:00 [WARNING] Skipping 'pharmacy_mail_order_pct' β missing mail_order
2025-04-05 12:00:00 [WARNING] Skipping 'prescriber_specialty_valid' β missing prescriber_specialty
Vendor Format Profiles
| Profile | Description |
|---|---|
auto |
Scan file header, detect best match (default) |
gsk_iqvia |
Expect IQVIA-style column names |
generic_csv |
Expect generic lower-case names |
cms_dmr |
CMS Drug Monitoring Report format |
elaad_apld |
ELAAD/APLD format with MEMBER_ID, HCP_ID, etc. |
unknown |
No preconceptions, rely fully on synonym matching |
Adding a New Vendor Synonym
No code changes needed. Edit config.py β COLUMN_SYNONYMS:
"patient_id": [
"IQVIA_PATIENT_ID", "PATIENT_ID", "PAT_ID", "MEMBER_ID",
"NEW_VENDOR_PATIENT_ID", # β add your vendor's name here
],
Input Data Schema
The pipeline accepts any of the following formats and auto-discovers columns:
| Format | Extension | Auto-detect? |
|---|---|---|
| CSV | .csv |
Yes |
| Tab-separated | .txt, .tsv |
Yes (scans first line for \t) |
| Gzipped tab | .txt.gz, .csv.gz |
Yes |
| ZIP archive | .zip |
Yes (extracts first CSV/TXT inside) |
| Excel | .xlsx, .xls |
Yes (requires openpyxl/xlrd) |
Key column groups the pipeline looks for:
- Patient:
IQVIA_PATIENT_ID/PATIENT_ID/MEMBER_IDβpatient_id - HCP:
IQVIA_PRESCRIBER_ID/HCP_ID/PRESCRIBER_ID/DOCTOR_IDβprescriber_npi - Claim:
CLAIM_NUMBER/CLAIM_NUM/CLAIMIDβclaim_number - Drug:
NDC/DRUG_NDC/NATIONAL_DRUG_CODEβdrug_ndc - Financial:
COPAY_AFTER_BENEFIT/COPAY_AFTER/OOP_COSTβcopay_after - Pharmacy:
PHARMACY_NABP_NUMBER/PHARMACY_ID/STORE_IDβpharmacy_nabp - Insurance:
PRIMARY_PAYER_BIN/PAYER_BIN/BINβprimary_payer_bin - Reject:
REJECT_CODE/REJECTION_CODE/DENIAL_CDβreject_code
See config.py::COLUMN_SYNONYMS for the full list of 100+ synonyms.
35 Business Rules
Original 23 Rules (v3)
| # | Rule | Condition | Fraud Signal |
|---|---|---|---|
| 1 | Early Refill | days_between_fills < 23 |
Early Refill Abuse |
| 2 | Impossible Qty | quantity != 1 |
Data Error / Fraud |
| 3 | Wrong Days Supply | days_supply != 30 |
Data Error / Fraud |
| 4 | Govt Insurance | insurance_type == Government |
Program Violation |
| 5 | Underage | patient_age < 18 |
Program Violation |
| 6 | Duplicate | Same patient + date + pharmacy | Duplicate Billing |
| 7 | NDC Switch | patient_ndc_count > 1 |
Strength Switching |
| 8 | Suspicious Specialty | Prescriber not in valid list | Prescriber Collusion |
| 9 | Multi-Program | unique_programs_per_patient > 1 |
Card Stacking |
| 10 | Excessive Fills (90d) | patient_fill_count_90d > 4 |
Stockpiling |
| 11 | High-Risk Reject | reject_code in {76, 88, 79} |
Maximizer / DUR |
| 12 | Maximizer Cap | maximizer_reject == 1 |
Benefit Exhaustion |
| 13 | Paper Submission | paper_submission == 1 |
Submission Fraud |
| 14 | Plan Switch | plan_switch_flag == 1 |
Plan Switching |
| 15 | Linked Claim | has_linked_claim == 1 |
Reversal / Adjustment |
| 16 | HCP High Benefit | hcp_avg_benefit_per_patient > 500 |
Prescriber-driven extraction |
| 17 | HCP One-Done Concentration | hcp_one_and_done_pct > 0.6 |
HCP with hit-and-run patients |
| 18 | Pharmacy Fraud Risk | pharmacy_fraud_risk_score > 0.6 |
Composite pharmacy ring score |
| 19 | Pharmacy HCP Concentration | pharmacy_hcp_concentration > 0.5 |
Single HCP dominates pharmacy |
| 20 | Pharmacy One-Done | pharmacy_one_and_done_pct > 0.5 |
Pharmacy with churn-and-burn |
| 21 | Short Active Burst | patient_active_duration <= 14 AND total_fills > 1 |
Quick-fire multi-fill scheme |
| 22 | Cross-State | patient_state != pharmacy_state |
Out-of-state fraud |
| 23 | New Patient Burst | days_since_first <= 7 AND total_fills > 1 |
Same-week multiple fills |
v4 Group-Aware Rules (NEW)
| # | Rule | Condition | Fraud Signal |
|---|---|---|---|
| 24 | Scenario Not Covered | scenario_not_covered_flag == 1 |
Cash/Rejected under Group 8200 |
| 25 | Benefit Cap Exceeded | excess_payment_amount > 0 |
Benefit > allowed cap for group |
| 26 | Invalid Period Benefit | invalid_period_benefit_flag == 1 |
$500 cap used outside Jan-Mar 2024 |
| 27 | Annual Fill Limit | annual_fill_count > 12 |
Exceeds 12 fills/year |
| 28 | Annual DS Limit | annual_days_supply_count > 360 |
Exceeds 360 days supply/year |
| 29 | Non-Covered NDC | non_covered_ndc_flag == 1 |
NDC not in covered list |
| 30 | Govt With Benefit | govt_claim_with_benefit_flag == 1 |
Govt plan receiving benefit |
| 31 | Quantity Out of Range | quantity < 1 or quantity > 3 |
Impossible quantity |
| 32 | Days Supply Out of Range | days_supply < 1 or days_supply > 90 |
Invalid supply duration |
| 33 | Max Benefit Repeat | Patient+pharmacy hits cap β₯3 times | Organized cap-maximization ring |
| 34 | High Cap Utilization | cap_utilization_ratio > 1.0 |
Paid more than 100% of cap |
| 35 | Group Benefit Mismatch | group_benefit_mismatch_flag == 1 |
Any group+scenario+period mismatch |
Rules whose source columns are missing are silently skipped (count = 0).
Model: Isolation Forest
IsolationForest(
n_estimators=200,
contamination=0.03, # Adjustable via CLI (use 0.40 for synthetic test data)
max_samples="auto",
max_features=1.0,
bootstrap=False,
random_state=42,
n_jobs=-1,
)
Training strategy: Train ONLY on claims where rule_flag == 0 (rule-clean). Score ALL claims.
Degraded mode: If zero rule-clean claims exist (e.g., synthetic data with 40% fraud), the model trains on ALL claims with contamination=min(origΓ3, 0.5). This is a safety fallback β real production data will always have rule-clean claims.
Priority score:
priority_score = 0.50 * if_anomaly_score + 0.30 * rule_severity + 0.20 * rule_flag
Risk Tiers:
| Tier | Score | Action |
|---|---|---|
| Low | 0.0β0.3 | No action |
| Medium | 0.3β0.6 | Monitor |
| High | 0.6β0.8 | Investigate (investigation queue) |
| Critical | 0.8β1.0 | Immediate investigation + audit trail |
Outputs
EDA Outputs
| File | Type | Description |
|---|---|---|
eda/00_data_quality.json |
JSON | Missing values, cardinality, numeric stats, outlier counts |
eda/01_temporal_analysis.png |
PNG | Monthly volume, day-of-week, quarterly, benefit trends |
eda/01_temporal_analysis.json |
JSON | Monthly counts + benefit totals |
eda/02_financial_analysis.png |
PNG | Distribution + boxplot for benefit, copay, OOP, usual customary |
eda/02b_benefit_by_scenario.png |
PNG | Benefit amount boxplot by claim scenario |
eda/02c_benefit_by_group.png |
PNG | Benefit amount boxplot by group |
eda/03_categorical_analysis.png |
PNG | Top 15 bar charts: specialty, scenario, NDC, pharmacy type, group, insurance |
eda/03_categorical_analysis.json |
JSON | Full top-20 category counts |
eda/04_geographic_analysis.png |
PNG | Top 20 state distributions + cross-state heatmap |
eda/04_geographic_analysis.json |
JSON | State counts + cross-state claim count |
eda/05_correlation_analysis.png |
PNG | Lower-triangle correlation heatmap |
eda/05_correlation_analysis.json |
JSON | Top 30 correlated feature pairs |
eda/06_hierarchical_analysis.png |
PNG | Claims/patient, patients/HCP, HCPs/pharmacy, pharmacies/patient |
eda/06_hierarchical_analysis.json |
JSON | Mean/median/max/std for each hierarchy |
eda/07_feature_engineered_eda.png |
PNG | Cap utilization, early refill, excess payment, scenario not covered |
Core Outputs
| File | Type | Description |
|---|---|---|
investigation_queue_top500.csv |
CSV | Top 500 highest-priority claims for manual review |
scored_claims_full.csv |
CSV | All claims with scores + risk tiers |
feature_importance_shap.csv |
CSV | SHAP ranking of features |
metrics.json |
JSON | All evaluation metrics + hierarchical summary counts |
schema_report.json |
JSON | Column mapping audit trail |
Hierarchical Summaries
| File | Type | Description |
|---|---|---|
transaction_level_summary.csv |
CSV | Per-claim analytical view |
hcp_level_summary.csv |
CSV | Per-HCP investigative lens |
pharmacy_level_summary.csv |
CSV | Per-pharmacy investigative lens |
patient_level_summary.csv |
CSV | Per-patient behavioral view |
Visualizations
| File | Type | Description |
|---|---|---|
01_evaluation_metrics.png |
PNG | Score distribution, ROC, PR, tier counts |
02_shap_summary.png |
PNG | SHAP beeswarm (top 20 features) |
03_shap_bar_importance.png |
PNG | SHAP bar chart |
04_rule_breakdown.png |
PNG | Flagged claims by each rule |
05_risk_tier_distribution.png |
PNG | Tier counts + fraud rates |
06_hcp_summary.png |
PNG | HCP risk score distribution + benefit vs risk scatter |
07_pharmacy_summary.png |
PNG | Pharmacy risk score distribution + fraud risk vs priority |
08_patient_summary.png |
PNG | Patient risk score distribution + one-and-done rate |
Model Artifacts
| File | Type | Description |
|---|---|---|
model/isolation_forest_model.pkl |
PKL | Trained model |
model/scaler.pkl |
PKL | StandardScaler |
model/encoder.pkl |
PKL | OrdinalEncoder |
model/feature_names.pkl |
PKL | Feature name list |
Configuration: Drug-Agnostic
Edit config.py to configure for any GSK product:
PRODUCT_CONFIG = {
"product_name": "Trelegy Ellipta",
"days_supply_expected": 30,
"quantity_expected": 1,
"ndc_list": {
"00173089314": {"strength": "100/62.5/25", "indication": "COPD/Asthma"},
"00173088714": {"strength": "200/62.5/25", "indication": "Asthma"},
},
"valid_prescriber_specialties": ["Pulmonology", "Allergy/Immunology", ...],
"suspicious_prescriber_specialties": ["Dermatology", "Orthopedics", ...],
"early_refill_threshold_days": 23,
"max_fills_90d": 4,
# HCP thresholds
"hcp_high_benefit_threshold": 500.0,
"hcp_patient_concentration_threshold": 0.6,
# Pharmacy thresholds
"pharmacy_high_benefit_threshold": 450.0,
"pharmacy_hcp_concentration_threshold": 0.5,
"pharmacy_one_done_threshold": 0.5,
# Patient thresholds
"patient_gap_short_threshold": 15,
"patient_gap_long_threshold": 60,
"patient_max_active_duration_days": 180,
# Group-aware limits
"annual_max_fills_per_patient": 12,
"annual_max_days_supply_per_patient": 360,
"quantity_min": 1,
"quantity_max": 3,
"days_supply_min": 1,
"days_supply_max": 90,
...
}
Tech Stack
- Python 3.9+
- pandas β₯ 2.0.0, numpy β₯ 1.24.0, scikit-learn β₯ 1.3.0
- shap β₯ 0.42.0, matplotlib β₯ 3.7.0, seaborn β₯ 0.12.0, joblib β₯ 1.3.0, pyarrow β₯ 12.0.0
License
Proprietary β GSK internal use.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'Harsh2396/gsk-copay-fraud-detection'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.