Loan Default Prediction with Explainable Credit Risk Scoring

Real multi-table loan application data, benchmarked across logistic regression, Random Forest, and LightGBM — with SHAP explainability and a cost-based decision threshold, exactly as a regulated lender actually needs.

7-9 hours end to end

·Machine Learning

Problem Statement

A lender receives a loan application and must decide two things: approve or reject, and if approved, at what interest rate. Getting this wrong has asymmetric, real costs — approving a loan that defaults loses the principal, while rejecting an applicant who would have repaid loses interest income and a customer. Unlike most classification problems taught as pure accuracy exercises, credit lending is a regulated domain: in most jurisdictions, a lender must be able to explain to a regulator, and often to the rejected applicant, WHY a specific decision was made. A model that cannot be explained, no matter how accurate, is not legally deployable. This project builds a complete system that respects both constraints: real predictive accuracy AND genuine per-prediction explainability.

Dataset

Home Credit Default Risk

Real loan application data spanning roughly 300,000 applications, spread across multiple linked tables: the main application table (over 120 raw features covering income, employment, family status, and requested loan terms), a credit bureau history table (each applicant's prior loans at OTHER institutions), and a previous-application table (this same lender's past dealings with the applicant, if any). This mirrors exactly how real credit data lives in production — never as one clean table, but as several tables requiring genuine feature engineering to combine.

~2.5 GB across all tables, ~300,000 applications, ~10% real historical default rateHome Credit Group, released publicly for the Home Credit Default Risk Kaggle competition

Architecture Decisions

This project deliberately trains three models rather than jumping straight to the most sophisticated option, following the same evidence-before-conclusion discipline used throughout this course: a plain logistic regression baseline (interpretable by design, the traditional credit-scoring standard), a Random Forest (a strong, robust default choice needing little tuning), and LightGBM (chosen over XGBoost specifically for its native handling of high-cardinality categorical features and materially faster training on a dataset this size with many categorical columns, both genuine, measurable advantages on this specific data rather than an assumed default). Each model is evaluated on the identical held-out data so the actual lift from added model complexity is measured, not assumed. SHAP is layered on top of the winning model specifically because raw feature importance alone cannot answer the regulatory question that matters: why did THIS specific applicant get THIS specific decision.

Built On

•ML Module 12 — Logistic Regression, the interpretable baseline this project measures every other model against
•ML Module — Random Forest and ensemble methods, the second baseline
•ML Module — Gradient Boosting (XGBoost/LightGBM), the primary model this project builds and tunes
•ML Module — Model Explainability (SHAP and LIME), applied here to a genuine regulatory requirement, not an optional add-on
•ML Module — Evaluation Metrics, extended here with a real cost-matrix-based threshold decision instead of a default 0.5 cutoff

Step 1 — Exploring Real, Multi-Table Data

Before any feature engineering happens, the raw data itself needs an honest audit. Real credit data has real problems no toy dataset shows: missing values that are not random (an applicant with no prior bureau history is systematically different from one with an average history, not a random gap to fill blindly), a genuinely moderate class imbalance (roughly 8% actual defaults), and several tables that must be correctly joined without accidentally duplicating or losing applicants. This step measures every one of these directly before deciding how to handle each.

Three Linked Tables, One Target

The main application table holds the target (default or not). Bureau and previous-application tables must be aggregated per applicant before joining, since each applicant can have many historical records in either table.

01_explore_multitable_data.py

python

1import pandas as pd
2import numpy as np
3
4app_train = pd.read_csv("./home_credit/application_train.csv")
5bureau = pd.read_csv("./home_credit/bureau.csv")
6
7print(f"Applications: {len(app_train):,}")
8print(f"Bureau records: {len(bureau):,} (across {bureau['SK_ID_CURR'].nunique():,} unique applicants)")
9print(f"Average bureau records per applicant with history: "
10      f"{len(bureau) / bureau['SK_ID_CURR'].nunique():.1f}\n")
11
12# ─── TARGET DISTRIBUTION -- CONFIRMING REAL, MODERATE IMBALANCE ────────
13default_rate = app_train["TARGET"].mean()
14print(f"Actual default rate: {default_rate:.2%}")
15print(f"Class imbalance ratio (non-default : default): {(1-default_rate)/default_rate:.1f} : 1\n")
16
17# ─── MISSING DATA -- NOT RANDOM, MEANINGFUL ─────────────────────────────
18missing_pct = (app_train.isnull().sum() / len(app_train) * 100).sort_values(ascending=False)
19print("=== TOP 10 COLUMNS BY MISSING PERCENTAGE ===")
20print(missing_pct.head(10))
21
22# Check whether missingness itself correlates with default -- a real,
23# important signal, not just noise to drop or blindly impute away
24has_bureau_history = app_train["SK_ID_CURR"].isin(bureau["SK_ID_CURR"])
25default_rate_with_history = app_train.loc[has_bureau_history, "TARGET"].mean()
26default_rate_without_history = app_train.loc[~has_bureau_history, "TARGET"].mean()
27
28print(f"\n=== DOES MISSING BUREAU HISTORY CORRELATE WITH DEFAULT? ===\n")
29print(f"Default rate WITH bureau history:    {default_rate_with_history:.2%}")
30print(f"Default rate WITHOUT bureau history: {default_rate_without_history:.2%}")
31print("""
32If these two rates differ meaningfully, "no bureau history" is
33itself a genuine, informative signal -- meaning a missing-value
34FLAG feature (rather than only an imputed value) should be created
35explicitly in Step 2, so the model can use the fact of missingness
36itself, not just a filled-in guess at what the value might be.
37""")

Gotchas

⚠Joining bureau or previous_application directly onto application_train without aggregating first will silently multiply rows — an applicant with 5 bureau records becomes 5 duplicate rows after a naive join, corrupting every downstream statistic including the class balance itself.
⚠A roughly 12:1 imbalance ratio is real but moderate — genuinely different from Project 3's extreme 580:1 fraud ratio later, meaning the handling strategy (class weighting, not aggressive oversampling) should be chosen to match THIS project's specific imbalance level, not copied from a different project's solution.
⚠Missingness that correlates with the target, exactly as measured above, must be preserved as a feature (a boolean flag) rather than erased by imputation — filling a missing value with a mean or median throws away real predictive signal the missingness itself carried.

Step 2 — Feature Engineering Across Tables

This is where the real work of this project lives. Bureau and previous-application data must be aggregated into per-applicant summary features (count of prior loans, average loan amount, worst delinquency status, and similarly for previous applications with this lender) before being joined onto the main table. This step also creates explicit missing-value flags for the columns identified in Step 1 as having meaningful, non-random missingness, and engineers domain-informed ratio features (like debt-to-income) that are known, from real credit scoring practice, to carry more signal than either raw component alone.

Aggregation Turns Many Rows Into Meaningful Per-Applicant Features

Each applicant's scattered bureau history becomes a handful of summary statistics — count, average, worst case — that a model can actually use as input columns.

02_feature_engineering.py

python

1import pandas as pd
2import numpy as np
3
4def aggregate_bureau_features(bureau_df: pd.DataFrame) -> pd.DataFrame:
5    """Turn many bureau rows per applicant into one summary row per applicant."""
6    aggregated = bureau_df.groupby("SK_ID_CURR").agg(
7        bureau_loan_count=("SK_ID_BUREAU", "count"),
8        bureau_avg_credit_amount=("AMT_CREDIT_SUM", "mean"),
9        bureau_max_overdue=("AMT_CREDIT_SUM_OVERDUE", "max"),
10        bureau_active_loan_count=("CREDIT_ACTIVE", lambda x: (x == "Active").sum()),
11    ).reset_index()
12
13    # bureau_pct_on_time computed separately since it needs the raw
14    # status column, not a single column aggregation
15    on_time_pct = bureau_df.groupby("SK_ID_CURR").apply(
16        lambda group: (group["CREDIT_DAYS_OVERDUE"] <= 0).mean()
17    ).rename("bureau_pct_on_time").reset_index()
18
19    aggregated = aggregated.merge(on_time_pct, on="SK_ID_CURR", how="left")
20    return aggregated
21
22def engineer_ratio_features(df: pd.DataFrame) -> pd.DataFrame:
23    """Domain-informed ratios known from real credit scoring practice to
24    carry more signal than either raw component alone."""
25    df = df.copy()
26    df["debt_to_income_ratio"] = df["AMT_CREDIT"] / (df["AMT_INCOME_TOTAL"] + 1)
27    df["annuity_to_income_ratio"] = df["AMT_ANNUITY"] / (df["AMT_INCOME_TOTAL"] + 1)
28    df["credit_to_goods_ratio"] = df["AMT_CREDIT"] / (df["AMT_GOODS_PRICE"] + 1)
29    df["employed_days_ratio"] = df["DAYS_EMPLOYED"] / (df["DAYS_BIRTH"] + 1)   # employment tenure relative to age
30    return df
31
32def create_missingness_flags(df: pd.DataFrame, columns_with_meaningful_missingness: list) -> pd.DataFrame:
33    """Preserve missingness as its own signal, following Step 1's finding
34    that missing bureau history correlates with default rate."""
35    df = df.copy()
36    for col in columns_with_meaningful_missingness:
37        df[f"{col}_was_missing"] = df[col].isnull().astype(int)
38    return df
39
40app_train = pd.read_csv("./home_credit/application_train.csv")
41bureau = pd.read_csv("./home_credit/bureau.csv")
42
43bureau_features = aggregate_bureau_features(bureau)
44
45app_train = app_train.merge(bureau_features, on="SK_ID_CURR", how="left")
46app_train = engineer_ratio_features(app_train)
47
48# Applicants with NO bureau history get NaN after the merge -- exactly
49# the meaningful missingness Step 1 flagged, preserved here explicitly
50app_train = create_missingness_flags(
51    app_train, ["bureau_loan_count", "bureau_avg_credit_amount"],
52)
53
54# Fill remaining NaNs in the NEW bureau features with 0, since "no
55# history" genuinely means zero prior loans, not an unknown average
56bureau_feature_columns = ["bureau_loan_count", "bureau_avg_credit_amount",
57                            "bureau_max_overdue", "bureau_active_loan_count",
58                            "bureau_pct_on_time"]
59app_train[bureau_feature_columns] = app_train[bureau_feature_columns].fillna(0)
60
61print(f"Final feature count: {app_train.shape[1]}")
62print(f"Sample engineered features:")
63print(app_train[["debt_to_income_ratio", "bureau_loan_count",
64                   "bureau_loan_count_was_missing"]].head())

Gotchas

⚠Filling bureau_loan_count with 0 for applicants with no history is correct here specifically BECAUSE zero genuinely means zero prior loans — this is different from filling a genuinely unknown value with a placeholder, which would misrepresent uncertainty as a real measurement.
⚠The +1 added to every ratio's denominator prevents division by zero for applicants with zero reported income — a small, deliberate numerical safeguard, not a meaningful distortion of the ratio for the vast majority of applicants with nonzero income.
⚠DAYS_EMPLOYED and DAYS_BIRTH in this real dataset are stored as NEGATIVE numbers (days before the application date) — a well-documented quirk of this specific dataset; failing to account for the sign convention would silently produce nonsensical ratio values.

Step 3 — Benchmarking Three Models Honestly

Rather than assuming gradient boosting is automatically the right choice, this step trains and fairly compares logistic regression, Random Forest, and LightGBM on the identical engineered features and identical train/validation split, using AUC-ROC and precision-recall AUC as the comparison metrics given the real, moderate class imbalance measured in Step 1. Class weighting (rather than oversampling, which risks creating unrealistic duplicate patterns at this dataset's scale) is applied identically across all three models for a fair comparison.

Three Models, One Fair Comparison

The same engineered features and the same validation split are used for all three models — isolating the model choice itself as the only variable, following the same controlled-comparison discipline used throughout this course.

03_benchmark_models.py

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import roc_auc_score, average_precision_score
5from sklearn.preprocessing import StandardScaler
6import lightgbm as lgb
7import pandas as pd
8import numpy as np
9
10# Assume app_train is Step 2's fully engineered dataframe
11feature_columns = [col for col in app_train.columns if col not in ["SK_ID_CURR", "TARGET"]]
12X = app_train[feature_columns].select_dtypes(include=[np.number])   # numeric features only for this comparison
13y = app_train["TARGET"]
14
15X_train, X_val, y_train, y_val = train_test_split(
16    X, y, test_size=0.2, stratify=y, random_state=42,
17)
18
19# scale_pos_weight / class_weight applied CONSISTENTLY across all three
20# models, isolating model choice as the only real variable
21imbalance_ratio = (y_train == 0).sum() / (y_train == 1).sum()
22
23results = {}
24
25# ─── BASELINE 1: LOGISTIC REGRESSION ───────────────────────────────────
26scaler = StandardScaler()
27X_train_scaled = scaler.fit_transform(X_train.fillna(0))
28X_val_scaled = scaler.transform(X_val.fillna(0))
29
30log_reg = LogisticRegression(class_weight="balanced", max_iter=1000)
31log_reg.fit(X_train_scaled, y_train)
32log_reg_probs = log_reg.predict_proba(X_val_scaled)[:, 1]
33results["Logistic Regression"] = {
34    "auc": roc_auc_score(y_val, log_reg_probs),
35    "pr_auc": average_precision_score(y_val, log_reg_probs),
36}
37
38# ─── BASELINE 2: RANDOM FOREST ──────────────────────────────────────────
39rf = RandomForestClassifier(
40    n_estimators=200, max_depth=10, class_weight="balanced",
41    random_state=42, n_jobs=-1,
42)
43rf.fit(X_train.fillna(0), y_train)
44rf_probs = rf.predict_proba(X_val.fillna(0))[:, 1]
45results["Random Forest"] = {
46    "auc": roc_auc_score(y_val, rf_probs),
47    "pr_auc": average_precision_score(y_val, rf_probs),
48}
49
50# ─── PRIMARY MODEL: LIGHTGBM ────────────────────────────────────────────
51lgb_train = lgb.Dataset(X_train, label=y_train)
52lgb_val = lgb.Dataset(X_val, label=y_val, reference=lgb_train)
53
54lgb_params = {
55    "objective": "binary",
56    "metric": "auc",
57    "scale_pos_weight": imbalance_ratio,   # SAME imbalance handling principle as the other two models
58    "num_leaves": 31,
59    "learning_rate": 0.05,
60    "verbose": -1,
61}
62
63lgb_model = lgb.train(
64    lgb_params, lgb_train, valid_sets=[lgb_val],
65    num_boost_round=500,
66    callbacks=[lgb.early_stopping(stopping_rounds=30)],
67)
68lgb_probs = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration)
69results["LightGBM"] = {
70    "auc": roc_auc_score(y_val, lgb_probs),
71    "pr_auc": average_precision_score(y_val, lgb_probs),
72}
73
74print("=== HONEST THREE-MODEL COMPARISON ===\n")
75print(f"{'Model':>22} | {'AUC-ROC':>10} | {'PR-AUC':>10}")
76print("-" * 50)
77for name, metrics in results.items():
78    print(f"{name:>22} | {metrics['auc']:>10.4f} | {metrics['pr_auc']:>10.4f}")
79
80lgb_model.save_model("lightgbm_credit_model.txt")
81print("\nLightGBM saved as the primary model, based on measured lift over both baselines.")

Gotchas

⚠class_weight="balanced" (sklearn) and scale_pos_weight (LightGBM) implement the same underlying idea — penalizing misclassification of the minority class more heavily — but are configured differently per library; using the SAME conceptual approach across all three models, rather than a different imbalance-handling technique per model, is what keeps this comparison fair.
⚠Logistic regression requires scaled features (StandardScaler) to converge reliably and for its coefficients to be meaningfully comparable — tree-based models like Random Forest and LightGBM do not need this, since they split on raw feature value thresholds regardless of scale; applying scaling only where it's needed, not universally, is the correct practice.
⚠early_stopping monitors validation AUC and halts training once it stops improving, preventing LightGBM from overfitting to the training set — the num_boost_round=500 is a generous upper bound, not the number of rounds actually used, which is instead determined by lgb_model.best_iteration.

Step 4 — SHAP Explainability and Cost-Based Thresholding

This step directly serves the regulatory requirement stated at the start of this project: SHAP values explain not just which features matter overall, but exactly why THIS specific applicant received THIS specific prediction, decomposing the prediction into each feature's individual contribution. Separately, the decision threshold is chosen using an actual cost matrix rather than the default 0.5 cutoff — since a missed default costs far more than an unnecessarily rejected good applicant, the optimal threshold is measurably lower than 0.5, a real business decision this step computes directly rather than assumes.

SHAP Explains One Specific Decision, Not Just Overall Importance

Global feature importance tells you what matters on average across all applicants. SHAP values for one applicant show exactly which of their specific attributes pushed the decision toward approval or rejection.

04_shap_and_thresholding.py

python

1import shap
2import lightgbm as lgb
3import numpy as np
4import pandas as pd
5
6model = lgb.Booster(model_file="lightgbm_credit_model.txt")
7explainer = shap.TreeExplainer(model)
8
9# ─── GLOBAL EXPLAINABILITY: WHAT MATTERS ON AVERAGE ────────────────────
10shap_values = explainer.shap_values(X_val)
11mean_abs_shap = np.abs(shap_values).mean(axis=0)
12feature_importance = pd.Series(mean_abs_shap, index=X_val.columns).sort_values(ascending=False)
13
14print("=== TOP 10 FEATURES BY AVERAGE SHAP IMPORTANCE ===\n")
15print(feature_importance.head(10))
16
17# ─── LOCAL EXPLAINABILITY: WHY THIS ONE APPLICANT ──────────────────────
18applicant_index = 42   # any specific row from the validation set
19applicant_shap_values = shap_values[applicant_index]
20applicant_features = X_val.iloc[applicant_index]
21
22top_contributing_features = pd.Series(applicant_shap_values, index=X_val.columns) \
23    .sort_values(key=abs, ascending=False).head(5)
24
25print(f"\n=== WHY APPLICANT AT ROW {applicant_index} GOT THIS PREDICTION ===\n")
26predicted_prob = model.predict(X_val.iloc[[applicant_index]])[0]
27print(f"Predicted default probability: {predicted_prob:.3f}\n")
28print(f"{'Feature':>30} | {'Applicant value':>16} | {'SHAP contribution':>18}")
29print("-" * 72)
30for feature_name, shap_value in top_contributing_features.items():
31    direction = "increases risk" if shap_value > 0 else "decreases risk"
32    print(f"{feature_name:>30} | {applicant_features[feature_name]:>16.2f} | "
33          f"{shap_value:>+18.4f} ({direction})")
34
35# ─── COST-BASED THRESHOLD, NOT THE DEFAULT 0.5 ─────────────────────────
36print("\n=== CHOOSING A THRESHOLD FROM A REAL COST MATRIX ===\n")
37
38# Illustrative costs -- a real lender would use actual measured figures
39COST_OF_MISSED_DEFAULT = 50000    # principal lost when a bad loan is approved
40COST_OF_UNNECESSARY_REJECTION = 3000   # lost interest income from a good applicant rejected
41
42thresholds_to_test = np.arange(0.1, 0.6, 0.05)
43best_threshold, lowest_total_cost = None, float("inf")
44
45lgb_probs = model.predict(X_val)
46
47for threshold in thresholds_to_test:
48    predictions = (lgb_probs >= threshold).astype(int)
49
50    false_negatives = ((predictions == 0) & (y_val == 1)).sum()   # missed defaults: approved, but defaulted
51    false_positives = ((predictions == 1) & (y_val == 0)).sum()   # unnecessary rejections: rejected, but would have repaid
52
53    total_cost = (false_negatives * COST_OF_MISSED_DEFAULT) + (false_positives * COST_OF_UNNECESSARY_REJECTION)
54
55    print(f"Threshold {threshold:.2f}: {false_negatives} missed defaults, "
56          f"{false_positives} unnecessary rejections, total cost = ₹{total_cost:,}")
57
58    if total_cost < lowest_total_cost:
59        lowest_total_cost = total_cost
60        best_threshold = threshold
61
62print(f"\nOptimal threshold based on this cost matrix: {best_threshold:.2f}")
63print(f"(Compare against the naive default of 0.5 -- the real optimal is")
64print(f"typically LOWER, since missed defaults cost far more per instance")
65print(f"than unnecessary rejections in this illustrative cost structure.)")

Gotchas

⚠shap.TreeExplainer is used specifically because it is exact and fast for tree-based models like LightGBM — a model-agnostic explainer (like KernelSHAP) would also work but far more slowly, and tree-specific explainers should always be preferred when the underlying model is tree-based.
⚠The cost figures used here are illustrative placeholders — a real deployment would use this lender's actual measured average loss per default and actual measured average profit per approved good loan, sourced from real historical financial data, not assumed round numbers.
⚠Optimizing the threshold purely for total cost can create a model that behaves very differently across different applicant subgroups — a genuinely responsible deployment would also check this threshold's fairness impact across protected demographic groups before finalizing it, a real regulatory and ethical consideration beyond this project's core scope but essential in practice.

Step 5 — Serving Predictions With Explanations Included

Following the same FastAPI load-once-at-startup pattern used throughout this course's deployment work, this endpoint returns not just a prediction, but the SHAP-based explanation alongside it — since, as established from the start, an unexplained decision is not a deployable one in this domain. The response also applies Step 4's cost-derived threshold rather than a default 0.5 cutoff.

05_serve_credit_model.py

python

1from fastapi import FastAPI
2from pydantic import BaseModel
3import lightgbm as lgb
4import shap
5import numpy as np
6
7app = FastAPI(title="Credit Risk Scoring API")
8
9model = None
10explainer = None
11DECISION_THRESHOLD = 0.28   # from Step 4's cost-based analysis, not the naive default of 0.5
12
13@app.on_event("startup")
14def load_model():
15    global model, explainer
16    model = lgb.Booster(model_file="lightgbm_credit_model.txt")
17    explainer = shap.TreeExplainer(model)
18    print("Credit risk model and SHAP explainer loaded.")
19
20class LoanApplication(BaseModel):
21    features: dict[str, float]   # feature_name -> value, matching training's exact feature set
22
23@app.post("/score")
24def score_application(application: LoanApplication):
25    feature_array = np.array([[application.features[name] for name in FEATURE_ORDER]])
26
27    default_probability = float(model.predict(feature_array)[0])
28    decision = "reject" if default_probability >= DECISION_THRESHOLD else "approve"
29
30    shap_values = explainer.shap_values(feature_array)[0]
31    top_factors = sorted(
32        zip(FEATURE_ORDER, shap_values), key=lambda x: abs(x[1]), reverse=True,
33    )[:5]
34
35    return {
36        "default_probability": round(default_probability, 4),
37        "decision": decision,
38        "threshold_used": DECISION_THRESHOLD,
39        "top_contributing_factors": [
40            {
41                "feature": name,
42                "contribution": round(float(value), 4),
43                "direction": "increases_risk" if value > 0 else "decreases_risk",
44            }
45            for name, value in top_factors
46        ],
47    }
48
49# Run with: uvicorn 05_serve_credit_model:app --host 0.0.0.0 --port 8000

Gotchas

⚠FEATURE_ORDER must exactly match the column order the model was trained on — this needs to be saved once during training (e.g. as a JSON list) and loaded at startup alongside the model, rather than reconstructed by guessing, since a mismatched order would silently produce meaningless predictions.
⚠Returning top_contributing_factors in every single response is a deliberate, non-negotiable design choice for this specific domain — unlike Project 1's crop disease API, which returns explanations only as an optional confidence flag, a credit decision without an explanation is not a legally complete response in most regulated jurisdictions.
⚠DECISION_THRESHOLD is hardcoded here as the value computed in Step 4, but a real production system would recompute this periodically as the cost matrix itself changes (interest rates, loss rates), rather than treating it as a permanent constant.