GSK Copay Card Fraud Detection System — v4 Group-Aware

Product Focus: Trelegy Ellipta (configurable for Nucala / any GSK product)
Methodology: Hybrid Rules + Isolation Forest + SHAP Explainability + Hierarchical Summaries + Group-Aware Benefit Validation + EDA
Architecture: Drug-agnostic, ground-truth-optional, vendor-format-agnostic, production-ready
Analytical Levels: Transaction → Patient → HCP → Pharmacy
Group-Aware Validation: Group 8141 (Legacy) vs Group 8200 / 2025 benefit designs

Overview

This system detects fraudulent and suspicious copay card claims in GSK pharmaceutical transaction data using a 4-level hierarchical analytical framework:

Level	What It Detects	Key Features
Transaction	Per-claim anomalies	Gap between fills, quantity, days supply, benefit amount, OOP cost, NDC switch
Patient	Behavioral patterns	One-and-done patients, active duration, avg gap between fills, short/long gap %
HCP	Prescriber-driven fraud	Suspicious specialty, one-and-done %, patient concentration, avg benefit per patient
Pharmacy	Pharmacy-centric rings	Active/closed flag, HCP concentration, one-and-done %, avg benefit, fraud risk score

Under the hood, the system combines:

35 hard-coded business rules (23 original + 12 v4 group-aware rules)
Isolation Forest unsupervised anomaly detection trained on rule-clean data
SHAP explainability for every flagged claim
Hierarchical summary exports for investigative lens views
Exploratory Data Analysis (EDA) module for pre-modeling data profiling

The pipeline supports any vendor format — ELAAD, APLD, IQVIA, CMS DMR, generic CSV. It auto-discovers column names via synonym mapping, handles missing columns gracefully, and produces a schema report showing what it found and what it missed.

Project Structure

gsk_copay_fraud/
├── config.py                          # Product config + column synonym mappings + FEATURE_DEPENDENCIES
├── data_ingestion.py                  # Schema discovery + vendor-agnostic ingestion
├── feature_engineering_v2.py          # 60+ features with graceful degradation
├── fraud_detection_pipeline_v3.py     # Full pipeline (35 rules + IF + SHAP + hierarchical summaries)
├── eda.py                             # NEW: Exploratory Data Analysis module
├── run_all_v3.py                      # CLI runner (now supports --run-eda)
├── generate_elaad_test_data.py        # Synthetic ELAAD-style test data generator with embedded fraud
├── requirements.txt                   # Dependencies
├── README.md                          # This file
├── data/                              # Place raw GSK data here
│   └── elaad_test_trelegy.csv
└── results/                           # All outputs generated here
    ├── eda/                           # NEW: EDA outputs (JSON + PNG)
    │   ├── 00_data_quality.json
    │   ├── 01_temporal_analysis.png
    │   ├── 02_financial_analysis.png
    │   ├── 03_categorical_analysis.png
    │   ├── 04_geographic_analysis.png
    │   ├── 05_correlation_analysis.png
    │   ├── 06_hierarchical_analysis.png
    │   └── 07_feature_engineered_eda.png
    ├── investigation_queue_top500.csv
    ├── scored_claims_full.csv
    ├── transaction_level_summary.csv
    ├── hcp_level_summary.csv
    ├── pharmacy_level_summary.csv
    ├── patient_level_summary.csv
    ├── feature_importance_shap.csv
    ├── metrics.json
    ├── schema_report.json
    ├── 01_evaluation_metrics.png
    ├── 02_shap_summary.png
    ├── 03_shap_bar_importance.png
    ├── 04_rule_breakdown.png
    ├── 05_risk_tier_distribution.png
    ├── 06_hcp_summary.png
    ├── 07_pharmacy_summary.png
    └── 08_patient_summary.png

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run EDA (Pre-Modeling Data Profiling)

# Standalone EDA
python eda.py --data-path data/elaad_test_trelegy.csv --file-type csv --vendor-format elaad_apld

# EDA + full fraud detection pipeline
python run_all_v3.py --data-path data/elaad_test_trelegy.csv --run-eda

# EDA only (skip feature-engineered EDA for speed)
python run_all_v3.py --data-path data/elaad_test_trelegy.csv --run-eda --skip-post-engineering-eda

3. Run Fraud Detection Pipeline

# Auto-detect everything (recommended)
python run_all_v3.py \
  --data-path data/vendor_file.txt.gz

# CSV file
python run_all_v3.py \
  --data-path data/gsk_copay_transactions.csv \
  --file-type csv

# Gzipped TXT (common GSK format)
python run_all_v3.py \
  --data-path data/GSK_COPAY_TRANSACTION_DAILY_20250725.TXT.GZ \
  --file-type txt.gz

# Force a specific vendor format (skips auto-discovery)
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --vendor-format gsk_iqvia \
  --contamination 0.03

# Adjust anomaly rate for unusual datasets
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --contamination 0.05

4. Generate & Test on Synthetic ELAAD Data

# Generate test data
python generate_elaad_test_data.py
# Creates data/elaad_test_trelegy.csv with embedded fraud patterns

# Run pipeline on synthetic data (high contamination needed because ~40% fraud)
python run_all_v3.py \
  --data-path data/elaad_test_trelegy.csv \
  --file-type csv \
  --contamination 0.40

EDA Module (eda.py)

The EDA module runs before the fraud detection pipeline to provide data quality assessment, trend analysis, and distribution profiling. It is critical for:

Identifying data quality issues before modeling
Understanding seasonal/temporal fraud patterns
Validating financial distributions against expectations
Detecting categorical imbalances (scenarios, specialties, groups)
Finding geographic anomalies
Establishing hierarchical relationship baselines

EDA Coverage

Analysis	Output	Description
Data Quality	`00_data_quality.json`	Missing values, cardinality, numeric ranges, outlier counts, cross-state summary
Temporal	`01_temporal_analysis.png + .json`	Monthly claim volume, day-of-week, quarterly, monthly benefit trend
Financial	`02_financial_analysis.png`	Benefit, copay, OOP, usual customary distributions + boxplots; benefit by scenario/group
Categorical	`03_categorical_analysis.png + .json`	Top 15 distributions: specialty, scenario, NDC, pharmacy type, group, insurance
Geographic	`04_geographic_analysis.png + .json`	Top 20 states, patient vs pharmacy state heatmap, cross-state claim count
Correlation	`05_correlation_analysis.png + .json`	Feature correlation matrix (lower triangle), top 30 correlated pairs
Hierarchical	`06_hierarchical_analysis.png + .json`	Claims per patient, patients per HCP, HCPs per pharmacy, pharmacies per patient
Feature-Engineered	`07_feature_engineered_eda.png`	Cap utilization ratio, early refill flags, excess payment, scenario not covered

Running EDA

# Standalone
python eda.py --data-path data/elaad_test_trelegy.csv --file-type csv --vendor-format elaad_apld

# Options
python eda.py \
  --data-path data/elaad_test_trelegy.csv \
  --file-type csv \
  --vendor-format elaad_apld \
  --results-dir results \
  --skip-post-engineering  # Skip feature-engineered EDA for speed

Vendor Format Handling

The Problem

Vendor files rarely match the idealised spec. A column named IQVIA_PATIENT_ID in the spec might appear as:

PATIENT_ID (ELAAD format)
MEMBER_ID (APLD format)
PATIENTID (no underscore)
PAT ID (space instead of underscore)
Patient ID (mixed case)
PATIENT_KEY (different suffix)

The Solution: Schema Discovery

The pipeline uses COLUMN_SYNONYMS in config.py — a dictionary where each internal column name maps to a list of possible raw names (20+ synonyms per column). When a file is loaded:

Scan header → collect all raw column names
Normalize → uppercase, strip whitespace, replace underscores/spaces with single space
Match → for each internal column, try synonyms in order of preference
Report → log what was mapped and what was missing
Continue → pipeline runs with whatever columns are available

Example Schema Report

2025-04-05 12:00:00 [INFO] Vendor format: detected=generic_csv, requested=auto
2025-04-05 12:00:00 [INFO] Schema discovery: mapped 38 / 50 internal columns
2025-04-05 12:00:00 [INFO] [pharmacy] 6/7 present. Missing: ['pharmacy_subcategory']
2025-04-05 12:00:00 [INFO] [reject] 1/3 present. Missing: ['reject_description', 'reject_type']
2025-04-05 12:00:00 [WARNING] Missing columns (12): ['hcp_id', 'record_type', 'other_coverage', ...]
2025-04-05 12:00:00 [INFO] Feature engineering on 85,432 claims...
2025-04-05 12:00:00 [WARNING] Skipping 'pharmacy_mail_order_pct' — missing mail_order
2025-04-05 12:00:00 [WARNING] Skipping 'prescriber_specialty_valid' — missing prescriber_specialty

Vendor Format Profiles

Profile	Description
`auto`	Scan file header, detect best match (default)
`gsk_iqvia`	Expect IQVIA-style column names
`generic_csv`	Expect generic lower-case names
`cms_dmr`	CMS Drug Monitoring Report format
`elaad_apld`	ELAAD/APLD format with MEMBER_ID, HCP_ID, etc.
`unknown`	No preconceptions, rely fully on synonym matching

Adding a New Vendor Synonym

No code changes needed. Edit config.py → COLUMN_SYNONYMS:

"patient_id": [
    "IQVIA_PATIENT_ID", "PATIENT_ID", "PAT_ID", "MEMBER_ID",
    "NEW_VENDOR_PATIENT_ID",  # ← add your vendor's name here
],

Input Data Schema

The pipeline accepts any of the following formats and auto-discovers columns:

Format	Extension	Auto-detect?
CSV	`.csv`	Yes
Tab-separated	`.txt`, `.tsv`	Yes (scans first line for `\t`)
Gzipped tab	`.txt.gz`, `.csv.gz`	Yes
ZIP archive	`.zip`	Yes (extracts first CSV/TXT inside)
Excel	`.xlsx`, `.xls`	Yes (requires openpyxl/xlrd)

Key column groups the pipeline looks for:

Patient: IQVIA_PATIENT_ID / PATIENT_ID / MEMBER_ID → patient_id
HCP: IQVIA_PRESCRIBER_ID / HCP_ID / PRESCRIBER_ID / DOCTOR_ID → prescriber_npi
Claim: CLAIM_NUMBER / CLAIM_NUM / CLAIMID → claim_number
Drug: NDC / DRUG_NDC / NATIONAL_DRUG_CODE → drug_ndc
Financial: COPAY_AFTER_BENEFIT / COPAY_AFTER / OOP_COST → copay_after
Pharmacy: PHARMACY_NABP_NUMBER / PHARMACY_ID / STORE_ID → pharmacy_nabp
Insurance: PRIMARY_PAYER_BIN / PAYER_BIN / BIN → primary_payer_bin
Reject: REJECT_CODE / REJECTION_CODE / DENIAL_CD → reject_code

See config.py::COLUMN_SYNONYMS for the full list of 100+ synonyms.

35 Business Rules

Original 23 Rules (v3)

#	Rule	Condition	Fraud Signal
1	Early Refill	`days_between_fills < 23`	Early Refill Abuse
2	Impossible Qty	`quantity != 1`	Data Error / Fraud
3	Wrong Days Supply	`days_supply != 30`	Data Error / Fraud
4	Govt Insurance	`insurance_type == Government`	Program Violation
5	Underage	`patient_age < 18`	Program Violation
6	Duplicate	Same patient + date + pharmacy	Duplicate Billing
7	NDC Switch	`patient_ndc_count > 1`	Strength Switching
8	Suspicious Specialty	Prescriber not in valid list	Prescriber Collusion
9	Multi-Program	`unique_programs_per_patient > 1`	Card Stacking
10	Excessive Fills (90d)	`patient_fill_count_90d > 4`	Stockpiling
11	High-Risk Reject	`reject_code in {76, 88, 79}`	Maximizer / DUR
12	Maximizer Cap	`maximizer_reject == 1`	Benefit Exhaustion
13	Paper Submission	`paper_submission == 1`	Submission Fraud
14	Plan Switch	`plan_switch_flag == 1`	Plan Switching
15	Linked Claim	`has_linked_claim == 1`	Reversal / Adjustment
16	HCP High Benefit	`hcp_avg_benefit_per_patient > 500`	Prescriber-driven extraction
17	HCP One-Done Concentration	`hcp_one_and_done_pct > 0.6`	HCP with hit-and-run patients
18	Pharmacy Fraud Risk	`pharmacy_fraud_risk_score > 0.6`	Composite pharmacy ring score
19	Pharmacy HCP Concentration	`pharmacy_hcp_concentration > 0.5`	Single HCP dominates pharmacy
20	Pharmacy One-Done	`pharmacy_one_and_done_pct > 0.5`	Pharmacy with churn-and-burn
21	Short Active Burst	`patient_active_duration <= 14` AND `total_fills > 1`	Quick-fire multi-fill scheme
22	Cross-State	`patient_state != pharmacy_state`	Out-of-state fraud
23	New Patient Burst	`days_since_first <= 7` AND `total_fills > 1`	Same-week multiple fills

v4 Group-Aware Rules (NEW)

#	Rule	Condition	Fraud Signal
24	Scenario Not Covered	`scenario_not_covered_flag == 1`	Cash/Rejected under Group 8200
25	Benefit Cap Exceeded	`excess_payment_amount > 0`	Benefit > allowed cap for group
26	Invalid Period Benefit	`invalid_period_benefit_flag == 1`	$500 cap used outside Jan-Mar 2024
27	Annual Fill Limit	`annual_fill_count > 12`	Exceeds 12 fills/year
28	Annual DS Limit	`annual_days_supply_count > 360`	Exceeds 360 days supply/year
29	Non-Covered NDC	`non_covered_ndc_flag == 1`	NDC not in covered list
30	Govt With Benefit	`govt_claim_with_benefit_flag == 1`	Govt plan receiving benefit
31	Quantity Out of Range	`quantity < 1` or `quantity > 3`	Impossible quantity
32	Days Supply Out of Range	`days_supply < 1` or `days_supply > 90`	Invalid supply duration
33	Max Benefit Repeat	Patient+pharmacy hits cap ≥3 times	Organized cap-maximization ring
34	High Cap Utilization	`cap_utilization_ratio > 1.0`	Paid more than 100% of cap
35	Group Benefit Mismatch	`group_benefit_mismatch_flag == 1`	Any group+scenario+period mismatch

Rules whose source columns are missing are silently skipped (count = 0).

Model: Isolation Forest

IsolationForest(
    n_estimators=200,
    contamination=0.03,        # Adjustable via CLI (use 0.40 for synthetic test data)
    max_samples="auto",
    max_features=1.0,
    bootstrap=False,
    random_state=42,
    n_jobs=-1,
)

Training strategy: Train ONLY on claims where rule_flag == 0 (rule-clean). Score ALL claims.

Degraded mode: If zero rule-clean claims exist (e.g., synthetic data with 40% fraud), the model trains on ALL claims with contamination=min(orig×3, 0.5). This is a safety fallback — real production data will always have rule-clean claims.

Priority score:

priority_score = 0.50 * if_anomaly_score + 0.30 * rule_severity + 0.20 * rule_flag

Risk Tiers:

Tier	Score	Action
Low	0.0–0.3	No action
Medium	0.3–0.6	Monitor
High	0.6–0.8	Investigate (investigation queue)
Critical	0.8–1.0	Immediate investigation + audit trail

Outputs

EDA Outputs

File	Type	Description
`eda/00_data_quality.json`	JSON	Missing values, cardinality, numeric stats, outlier counts
`eda/01_temporal_analysis.png`	PNG	Monthly volume, day-of-week, quarterly, benefit trends
`eda/01_temporal_analysis.json`	JSON	Monthly counts + benefit totals
`eda/02_financial_analysis.png`	PNG	Distribution + boxplot for benefit, copay, OOP, usual customary
`eda/02b_benefit_by_scenario.png`	PNG	Benefit amount boxplot by claim scenario
`eda/02c_benefit_by_group.png`	PNG	Benefit amount boxplot by group
`eda/03_categorical_analysis.png`	PNG	Top 15 bar charts: specialty, scenario, NDC, pharmacy type, group, insurance
`eda/03_categorical_analysis.json`	JSON	Full top-20 category counts
`eda/04_geographic_analysis.png`	PNG	Top 20 state distributions + cross-state heatmap
`eda/04_geographic_analysis.json`	JSON	State counts + cross-state claim count
`eda/05_correlation_analysis.png`	PNG	Lower-triangle correlation heatmap
`eda/05_correlation_analysis.json`	JSON	Top 30 correlated feature pairs
`eda/06_hierarchical_analysis.png`	PNG	Claims/patient, patients/HCP, HCPs/pharmacy, pharmacies/patient
`eda/06_hierarchical_analysis.json`	JSON	Mean/median/max/std for each hierarchy
`eda/07_feature_engineered_eda.png`	PNG	Cap utilization, early refill, excess payment, scenario not covered

Core Outputs

File	Type	Description
`investigation_queue_top500.csv`	CSV	Top 500 highest-priority claims for manual review
`scored_claims_full.csv`	CSV	All claims with scores + risk tiers
`feature_importance_shap.csv`	CSV	SHAP ranking of features
`metrics.json`	JSON	All evaluation metrics + hierarchical summary counts
`schema_report.json`	JSON	Column mapping audit trail

Hierarchical Summaries

File	Type	Description
`transaction_level_summary.csv`	CSV	Per-claim analytical view
`hcp_level_summary.csv`	CSV	Per-HCP investigative lens
`pharmacy_level_summary.csv`	CSV	Per-pharmacy investigative lens
`patient_level_summary.csv`	CSV	Per-patient behavioral view

Visualizations

File	Type	Description
`01_evaluation_metrics.png`	PNG	Score distribution, ROC, PR, tier counts
`02_shap_summary.png`	PNG	SHAP beeswarm (top 20 features)
`03_shap_bar_importance.png`	PNG	SHAP bar chart
`04_rule_breakdown.png`	PNG	Flagged claims by each rule
`05_risk_tier_distribution.png`	PNG	Tier counts + fraud rates
`06_hcp_summary.png`	PNG	HCP risk score distribution + benefit vs risk scatter
`07_pharmacy_summary.png`	PNG	Pharmacy risk score distribution + fraud risk vs priority
`08_patient_summary.png`	PNG	Patient risk score distribution + one-and-done rate

Model Artifacts

File	Type	Description
`model/isolation_forest_model.pkl`	PKL	Trained model
`model/scaler.pkl`	PKL	StandardScaler
`model/encoder.pkl`	PKL	OrdinalEncoder
`model/feature_names.pkl`	PKL	Feature name list

Configuration: Drug-Agnostic

Edit config.py to configure for any GSK product:

PRODUCT_CONFIG = {
    "product_name": "Trelegy Ellipta",
    "days_supply_expected": 30,
    "quantity_expected": 1,
    "ndc_list": {
        "00173089314": {"strength": "100/62.5/25", "indication": "COPD/Asthma"},
        "00173088714": {"strength": "200/62.5/25", "indication": "Asthma"},
    },
    "valid_prescriber_specialties": ["Pulmonology", "Allergy/Immunology", ...],
    "suspicious_prescriber_specialties": ["Dermatology", "Orthopedics", ...],
    "early_refill_threshold_days": 23,
    "max_fills_90d": 4,
    # HCP thresholds
    "hcp_high_benefit_threshold": 500.0,
    "hcp_patient_concentration_threshold": 0.6,
    # Pharmacy thresholds
    "pharmacy_high_benefit_threshold": 450.0,
    "pharmacy_hcp_concentration_threshold": 0.5,
    "pharmacy_one_done_threshold": 0.5,
    # Patient thresholds
    "patient_gap_short_threshold": 15,
    "patient_gap_long_threshold": 60,
    "patient_max_active_duration_days": 180,
    # Group-aware limits
    "annual_max_fills_per_patient": 12,
    "annual_max_days_supply_per_patient": 360,
    "quantity_min": 1,
    "quantity_max": 3,
    "days_supply_min": 1,
    "days_supply_max": 90,
    ...
}

Tech Stack

Python 3.9+
pandas ≥ 2.0.0, numpy ≥ 1.24.0, scikit-learn ≥ 1.3.0
shap ≥ 0.42.0, matplotlib ≥ 3.7.0, seaborn ≥ 0.12.0, joblib ≥ 1.3.0, pyarrow ≥ 12.0.0

License

Proprietary — GSK internal use.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Harsh2396/gsk-copay-fraud-detection'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support