Credit Card Fraud Detection at Extreme Class Imbalance

Real anonymized transaction data with only 0.17% fraud, benchmarking Random Forest, XGBoost, and Isolation Forest — because at this imbalance level, which approach wins is a genuinely open question worth measuring.

6-8 hours end to end

·Machine Learning

Problem Statement

A payment processor must flag likely-fraudulent transactions within milliseconds of a card swipe, from a stream where genuine fraud is extraordinarily rare — under 0.2% of all transactions in this dataset. This is a fundamentally different regime from Project 1's moderate 8% imbalance: standard classification metrics like accuracy and even ROC-AUC become dangerously misleading here, since a model that predicts 'not fraud' for every single transaction already achieves 99.8% accuracy while catching zero fraud. Every design decision in this project — which metric to trust, which model family to use, where to set the decision threshold — has to be rebuilt around this extreme imbalance rather than reusing Project 1's moderate-imbalance playbook unchanged.

Dataset

Credit Card Fraud Detection (ULB)

284,807 real credit card transactions made by European cardholders over two days in September 2013, with only 492 confirmed frauds — a 0.173% positive rate. Due to genuine confidentiality requirements, the original transaction features were transformed via PCA into 28 anonymized components (V1 through V28), plus untransformed Time and Amount columns. This anonymization is itself a realistic production constraint: real fraud systems very often work with transformed or restricted features for privacy and security reasons, not raw, human-readable transaction details.

~144 MB, 284,807 transactions, 492 confirmed fraud casesMachine Learning Group, Université Libre de Bruxelles (ULB), released publicly for research

Architecture Decisions

This project deliberately benchmarks three structurally different approaches rather than assuming gradient boosting automatically wins, because extreme imbalance is exactly the regime where that assumption can break down: a class-weighted Random Forest, an XGBoost model with scale_pos_weight tuned specifically for this imbalance ratio, and an Isolation Forest, which takes a completely different approach by learning what NORMAL transactions look like and flagging statistical outliers, using no fraud labels during training at all. The Isolation Forest is included specifically because, at this level of imbalance, a supervised model has only 492 positive examples to learn from across the entire dataset, while an anomaly-detection approach can, in principle, learn a rich model of normal behavior from the 284,315 legitimate transactions instead — a genuinely different bet on where the exploitable signal lies, worth measuring rather than dismissing.

Built On

•ML Module — Random Forest, the first benchmark, extended here with class weighting for extreme imbalance
•ML Module — Gradient Boosting (XGBoost), tuned specifically for this imbalance ratio via scale_pos_weight
•ML Module — Unsupervised Learning, extended here to anomaly detection via Isolation Forest, a genuinely different paradigm from supervised classification
•ML Module — Evaluation Metrics, extended here to precision-recall curves as the primary lens instead of ROC-AUC, which is measurably misleading at this imbalance level
•ML Module — Handling Real World Challenges, directly addressing the concept drift concern this project closes with

Step 1 — Measuring Exactly How Extreme This Imbalance Is

Before any model gets built, this step measures precisely why standard metrics fail here, rather than asserting it. A trivial baseline that predicts 'not fraud' for every transaction is computed directly, and its accuracy and ROC-AUC are measured — proving concretely that these metrics alone cannot distinguish a genuinely useless model from a genuinely useful one at this imbalance level, which is exactly why this project uses precision-recall AUC as its primary metric from this point forward.

A Useless Model Can Still Score 99.8% Accuracy

Predicting 'not fraud' for every single transaction achieves extremely high accuracy purely because fraud is so rare — accuracy alone cannot tell a genuinely useful model apart from one that catches zero fraud.

01_measure_extreme_imbalance.py

python

1import pandas as pd
2import numpy as np
3from sklearn.metrics import accuracy_score, roc_auc_score, average_precision_score
4
5df = pd.read_csv("./creditcard.csv")
6
7print(f"Total transactions: {len(df):,}")
8fraud_count = df["Class"].sum()
9fraud_rate = df["Class"].mean()
10print(f"Confirmed fraud cases: {fraud_count}")
11print(f"Fraud rate: {fraud_rate:.4%}")
12print(f"Imbalance ratio (legitimate : fraud): {(1-fraud_rate)/fraud_rate:.0f} : 1\n")
13
14# ─── THE TRIVIAL BASELINE: PREDICT "NOT FRAUD" FOR EVERYTHING ──────────
15y_true = df["Class"].values
16trivial_predictions = np.zeros_like(y_true)          # always predicts 0 (not fraud)
17trivial_probabilities = np.zeros_like(y_true, dtype=float)   # always predicts 0% fraud probability
18
19trivial_accuracy = accuracy_score(y_true, trivial_predictions)
20trivial_roc_auc = roc_auc_score(y_true, trivial_probabilities)
21trivial_pr_auc = average_precision_score(y_true, trivial_probabilities)
22
23print("=== THE TRIVIAL 'ALWAYS PREDICT NOT-FRAUD' BASELINE ===\n")
24print(f"Accuracy: {trivial_accuracy:.4%}   <- looks excellent, is completely useless")
25print(f"ROC-AUC:  {trivial_roc_auc:.4f}   <- undefined/misleading with an all-zero score")
26print(f"PR-AUC:   {trivial_pr_auc:.4f}   <- correctly reflects catching ZERO fraud")
27
28print(f"""
29=== WHY THIS PROJECT USES PR-AUC AS ITS PRIMARY METRIC ===
30
31Accuracy is dominated entirely by the overwhelming majority class,
32making it useless for judging fraud-catching ability directly. PR-AUC
33focuses specifically on how well a model finds the rare positive
34class among everything it flags, which is precisely the ability that
35matters here. Every model in this project is compared using PR-AUC
36first, with ROC-AUC reported only as a secondary reference.
37""")

Gotchas

⚠average_precision_score is used here as PR-AUC — this is the standard, correct scikit-learn function name for computing the area under the precision-recall curve; it is not the same computation as roc_auc_score, and the two should never be used interchangeably when comparing models at extreme imbalance.
⚠roc_auc_score on an all-zero probability array can raise a warning or produce a degenerate value depending on the scikit-learn version — this is itself informative: ROC-AUC's insensitivity to extreme imbalance is exactly the property that makes it a poor primary metric here, however it happens to compute for this specific degenerate baseline.
⚠This 492-fraud, 284,807-transaction imbalance ratio (roughly 577:1) is dramatically more extreme than Project 1's roughly 12:1 credit default ratio — the SAME imbalance-handling techniques (naive oversampling especially) that were reasonable there can behave very differently and need careful re-evaluation at this scale.

Step 2 — Benchmarking Random Forest and XGBoost With Proper Imbalance Handling

Both supervised models are configured specifically for this extreme imbalance level — class_weight for Random Forest and a precisely computed scale_pos_weight for XGBoost, both set to reflect the true 577:1 ratio measured in Step 1, not a smaller, more moderate value that might be appropriate for a less extreme dataset. Both models are evaluated with a proper stratified split, ensuring the tiny number of fraud cases is represented proportionally in both the training and validation sets, since a non-stratified split risks a validation set with very few or even zero fraud examples purely by chance.

Stratified Splitting — Preserving the Rare Class in Every Split

With only 492 fraud cases total, a non-stratified random split risks an uneven, unlucky distribution of fraud between train and validation. Stratified splitting guarantees both sets keep the same fraud rate.

02_supervised_benchmark.py

python

1from sklearn.model_selection import train_test_split
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.metrics import average_precision_score, roc_auc_score, precision_recall_curve
4import xgboost as xgb
5import pandas as pd
6import numpy as np
7
8df = pd.read_csv("./creditcard.csv")
9
10feature_columns = [col for col in df.columns if col not in ["Class", "Time"]]
11X = df[feature_columns]
12y = df["Class"]
13
14# Stratified split -- ESSENTIAL here, preserving the exact 0.173% fraud
15# rate in both the training and validation sets
16X_train, X_val, y_train, y_val = train_test_split(
17    X, y, test_size=0.3, stratify=y, random_state=42,
18)
19
20print(f"Training set fraud count: {y_train.sum()} out of {len(y_train):,} ({y_train.mean():.4%})")
21print(f"Validation set fraud count: {y_val.sum()} out of {len(y_val):,} ({y_val.mean():.4%})\n")
22
23results = {}
24
25# ─── RANDOM FOREST, WEIGHTED FOR THIS EXACT IMBALANCE ──────────────────
26rf = RandomForestClassifier(
27    n_estimators=200,
28    max_depth=12,
29    class_weight="balanced",   # scikit-learn computes weights automatically from the true class ratio
30    random_state=42,
31    n_jobs=-1,
32)
33rf.fit(X_train, y_train)
34rf_probabilities = rf.predict_proba(X_val)[:, 1]
35
36results["Random Forest"] = {
37    "pr_auc": average_precision_score(y_val, rf_probabilities),
38    "roc_auc": roc_auc_score(y_val, rf_probabilities),
39}
40
41# ─── XGBOOST, scale_pos_weight SET EXACTLY FOR THIS IMBALANCE RATIO ────
42imbalance_ratio = (y_train == 0).sum() / (y_train == 1).sum()
43print(f"Computed scale_pos_weight for XGBoost: {imbalance_ratio:.1f}\n")
44
45xgb_model = xgb.XGBClassifier(
46    n_estimators=300,
47    max_depth=6,
48    learning_rate=0.05,
49    scale_pos_weight=imbalance_ratio,
50    eval_metric="aucpr",   # optimize directly for PR-AUC, not the default logloss
51    random_state=42,
52)
53xgb_model.fit(X_train, y_train)
54xgb_probabilities = xgb_model.predict_proba(X_val)[:, 1]
55
56results["XGBoost"] = {
57    "pr_auc": average_precision_score(y_val, xgb_probabilities),
58    "roc_auc": roc_auc_score(y_val, xgb_probabilities),
59}
60
61print("=== SUPERVISED MODEL COMPARISON (PR-AUC IS THE PRIMARY METRIC) ===\n")
62print(f"{'Model':>16} | {'PR-AUC':>10} | {'ROC-AUC':>10}")
63print("-" * 42)
64for name, metrics in results.items():
65    print(f"{name:>16} | {metrics['pr_auc']:>10.4f} | {metrics['roc_auc']:>10.4f}")
66
67xgb_model.save_model("xgboost_fraud_model.json")
68import joblib
69joblib.dump(rf, "random_forest_fraud_model.pkl")

Gotchas

⚠scale_pos_weight in XGBoost must be computed from the TRAINING set's actual class ratio, not the full dataset's ratio — computing it from data the model will also validate against would leak a small amount of information about the validation set's composition into a training-time hyperparameter.
⚠eval_metric="aucpr" tells XGBoost to internally track precision-recall AUC during training rather than its default logloss — this aligns the model's own internal monitoring with the actual metric this project has established as the one that matters, rather than optimizing for a metric this project has already shown to be less relevant here.
⚠Both ROC-AUC values will likely look deceptively high and close together for both models even if their PR-AUC values differ meaningfully — this is expected and is precisely Step 1's point: ROC-AUC compresses the visible difference between models at extreme imbalance, which is exactly why PR-AUC is reported as primary and ROC-AUC only as a secondary reference here.

Step 3 — Isolation Forest: A Genuinely Different Approach

Isolation Forest takes an entirely different bet: rather than learning to separate fraud from legitimate transactions using labels, it learns what NORMAL transactions look like using no fraud labels at all during training, then flags transactions that are statistically easy to isolate from the rest as anomalies. This step trains an Isolation Forest, converts its raw anomaly scores into a comparable probability-like ranking, and measures its PR-AUC using the exact same metric and validation set as Step 2's supervised models — a fair, direct comparison across a genuinely different modeling paradigm.

Isolation Forest — Learning Normal, Flagging Outliers

Unlike Random Forest and XGBoost, which learn directly from labeled fraud examples, Isolation Forest never sees fraud labels during training — it learns the shape of normal transactions and flags points that are easy to isolate as statistical outliers.

03_isolation_forest.py

python

1from sklearn.ensemble import IsolationForest
2from sklearn.metrics import average_precision_score, roc_auc_score
3import numpy as np
4
5# Isolation Forest is trained on the TRAINING SET's FEATURES ONLY --
6# y_train is never passed to .fit(), by design, since this model
7# learns a notion of "normal" rather than a supervised boundary
8true_fraud_rate = y_train.mean()
9
10isolation_forest = IsolationForest(
11    n_estimators=200,
12    contamination=true_fraud_rate,   # tells the model roughly what fraction of data to expect as anomalous
13    random_state=42,
14    n_jobs=-1,
15)
16isolation_forest.fit(X_train)   # NOTE: no y_train passed here at all
17
18# score_samples returns HIGHER values for MORE NORMAL points and LOWER
19# (more negative) values for MORE ANOMALOUS points -- the OPPOSITE
20# direction of a fraud probability, so it must be negated below
21raw_anomaly_scores = isolation_forest.score_samples(X_val)
22fraud_likelihood_scores = -raw_anomaly_scores   # now HIGHER means MORE likely to be fraud, matching the other models' convention
23
24isolation_forest_pr_auc = average_precision_score(y_val, fraud_likelihood_scores)
25isolation_forest_roc_auc = roc_auc_score(y_val, fraud_likelihood_scores)
26
27print("=== ISOLATION FOREST RESULTS ===\n")
28print(f"PR-AUC:  {isolation_forest_pr_auc:.4f}")
29print(f"ROC-AUC: {isolation_forest_roc_auc:.4f}\n")
30
31print("=== FULL THREE-WAY COMPARISON ===\n")
32print(f"{'Model':>18} | {'PR-AUC':>10} | {'ROC-AUC':>10} | {'Uses fraud labels?':>20}")
33print("-" * 66)
34print(f"{'Random Forest':>18} | {results['Random Forest']['pr_auc']:>10.4f} | "
35      f"{results['Random Forest']['roc_auc']:>10.4f} | {'Yes':>20}")
36print(f"{'XGBoost':>18} | {results['XGBoost']['pr_auc']:>10.4f} | "
37      f"{results['XGBoost']['roc_auc']:>10.4f} | {'Yes':>20}")
38print(f"{'Isolation Forest':>18} | {isolation_forest_pr_auc:>10.4f} | "
39      f"{isolation_forest_roc_auc:>10.4f} | {'No':>20}")
40
41print(f"""
42=== THE HONEST CONCLUSION ===
43
44On THIS dataset, the supervised models (which DO get to see the 492
45real fraud examples during training) typically outperform Isolation
46Forest on PR-AUC, since 492 labeled examples turns out to be enough
47signal for a supervised model to exploit directly. Isolation Forest
48remains genuinely valuable in a DIFFERENT real scenario this dataset
49does not fully represent: detecting BRAND NEW fraud patterns that
50look nothing like any past labeled example, since it never depended
51on having seen fraud labeled as such in the first place -- a
52capability plain supervised models structurally lack.
53""")

Gotchas

⚠Isolation Forest's score_samples output convention is easy to get backwards — higher scores mean MORE normal, not more anomalous, which is the opposite of what average_precision_score and roc_auc_score expect (they expect higher scores to mean more likely to belong to the positive/fraud class). Negating the raw score, as done here, is required for a correct, meaningful PR-AUC calculation.
⚠The contamination parameter is Isolation Forest's estimate of what fraction of the data is anomalous — setting it to the true fraud rate is a reasonable, informed choice here since that rate is actually known from the labels (even though the labels themselves are not used for training), but in a genuinely unlabeled real-world scenario this would need to be estimated or tuned rather than known exactly.
⚠Isolation Forest is trained here on X_train only, never on y_train — passing y_train to .fit() would either raise an error or be silently ignored depending on the scikit-learn version, since IsolationForest's fit signature does not use a target label for its unsupervised training process at all.

Step 4 — Cost-Based Threshold Selection and Concept Drift

Following Project 1's exact cost-matrix principle, this step chooses a decision threshold for the winning model based on real, asymmetric costs: a missed fraud case costs the full transaction amount, while a false alarm costs a customer service interaction and potential customer friction. This step closes with a direct discussion of concept drift — a concern genuinely more urgent here than in either prior project, since fraud patterns actively and deliberately evolve as fraudsters adapt to detection systems, meaning this model's performance will degrade over time in a way Project 1's and Project 2's models are far less likely to.

04_cost_threshold_and_drift.py

python

1import numpy as np
2from sklearn.metrics import precision_recall_curve
3
4# Using XGBoost's probabilities from Step 2, assumed to be the winning model
5precisions, recalls, thresholds = precision_recall_curve(y_val, xgb_probabilities)
6
7# Illustrative costs -- a real payment processor would use actual measured figures
8AVG_FRAUD_TRANSACTION_AMOUNT = 8500    # average amount lost per undetected fraud
9COST_OF_FALSE_ALARM = 150              # customer service cost + friction per unnecessary flag
10
11best_threshold, lowest_total_cost = 0.5, float("inf")
12
13for threshold in np.arange(0.05, 0.95, 0.05):
14    predictions = (xgb_probabilities >= threshold).astype(int)
15
16    false_negatives = ((predictions == 0) & (y_val == 1)).sum()   # missed fraud
17    false_positives = ((predictions == 1) & (y_val == 0)).sum()   # false alarms
18
19    total_cost = (false_negatives * AVG_FRAUD_TRANSACTION_AMOUNT) + (false_positives * COST_OF_FALSE_ALARM)
20
21    if total_cost < lowest_total_cost:
22        lowest_total_cost = total_cost
23        best_threshold = threshold
24
25print(f"Optimal threshold based on this cost matrix: {best_threshold:.2f}")
26print(f"Estimated total cost at this threshold: ₹{lowest_total_cost:,.0f}\n")
27
28final_predictions = (xgb_probabilities >= best_threshold).astype(int)
29final_false_negatives = ((final_predictions == 0) & (y_val == 1)).sum()
30final_false_positives = ((final_predictions == 1) & (y_val == 0)).sum()
31total_actual_fraud = y_val.sum()
32
33print(f"At this threshold: caught {total_actual_fraud - final_false_negatives} out of "
34      f"{total_actual_fraud} actual fraud cases ({final_false_positives} false alarms)")
35
36print(f"""
37=== CONCEPT DRIFT: WHY THIS MODEL NEEDS MORE ACTIVE MONITORING THAN PROJECTS 1 AND 2 ===
38
39Credit default risk factors (Project 1) and retail seasonal demand
40patterns (Project 2) shift slowly, over months or years. Fraud is
41fundamentally different: fraudsters ACTIVELY adapt their behavior in
42direct response to detection systems, often within days or weeks of
43a new pattern being caught and blocked. This means:
44
45  1. This model's PR-AUC on NEW, incoming transactions should be
46     tracked continuously in production, not checked once at
47     deployment and assumed to remain valid.
48  2. A measurable, sustained drop in precision or recall on live
49     data is an early, actionable signal that fraud patterns have
50     shifted and the model needs retraining on more recent data.
51  3. Unlike Projects 1 and 2, where retraining on a fixed schedule
52     (e.g. quarterly) is often adequate, a fraud detection system in
53     production typically needs retraining triggered by MEASURED
54     performance degradation, not merely a fixed calendar schedule.
55""")

Gotchas

⚠The threshold sweep here uses xgb_probabilities from Step 2's validation set for illustration — a real production threshold decision would typically be validated on a separate, more recent holdout period specifically to check the threshold still performs well as time passes, directly anticipating the concept drift concern this step raises.
⚠AVG_FRAUD_TRANSACTION_AMOUNT is a single average figure used here for simplicity — a more precise real system would use the ACTUAL transaction amount for each specific case being scored, since blocking a small $20 charge and a $9,000 charge do not carry the same cost of a missed fraud, an extension worth building on top of this project's single-average approach.
⚠Concept drift monitoring requires access to GROUND TRUTH fraud labels arriving after the fact (confirmed via chargebacks or customer reports) to measure real-world precision and recall over time — this creates an inherent lag between when a fraud pattern shifts and when monitoring can detect that shift, a genuine, unavoidable limitation of production fraud detection worth being explicit about rather than glossing over.