Retail Demand Forecasting with Time-Aware Feature Engineering

Real daily sales across 1,115 stores, forecasted with XGBoost and engineered temporal features — proving gradient boosting with good features beats a naive sequence model here.

6-8 hours end to end

·Machine Learning

Problem Statement

A retail chain needs to forecast next week's daily sales for every store, to drive inventory ordering and staff scheduling decisions made a week in advance. The cost of error is asymmetric and real: underforecasting means empty shelves and lost sales, overforecasting means wasted inventory, spoiled perishables, and excess staffing cost. This project builds a forecasting system using XGBoost with engineered time-based features, deliberately choosing this approach over a recurrent neural network to demonstrate a genuine, common industry finding: for tabular time series at this scale, explicit feature engineering combined with gradient boosting routinely outperforms a much more expensive sequence model, because the temporal structure can be encoded directly as features rather than left for an RNN to discover implicitly from far less data than deep learning typically needs.

Dataset

Rossmann Store Sales

Real daily sales records for 1,115 Rossmann drug stores across roughly 2.5 years, including holidays, running promotions, competitor distance and opening dates, and genuine real-world data gaps (stores closed for refurbishment, missing competitor data for some stores). This is the actual dataset used in a real Kaggle forecasting competition, with all the messiness of genuine retail reporting.

~1.2 million daily store-level records, 1,115 stores, ~41 MBRossmann GmbH, released publicly for the Rossmann Store Sales Kaggle competition

Architecture Decisions

The central decision this project makes explicit, and measures rather than assumes, is feature engineering over model sophistication: lag features (sales 7 and 14 days ago), rolling averages (trailing 7-day and 30-day mean sales), and explicit calendar features (day of week, month, whether a holiday or promotion is active) are engineered directly, then fed into XGBoost — a model with no inherent sense of time at all. This works specifically because the engineered features carry the temporal information a sequence model would otherwise need to learn on its own, and XGBoost's proven strength (Project 1's benchmark) at exploiting exactly this kind of structured, tabular signal makes it the stronger choice here despite having no architectural awareness of sequence order. Time-based (not random) train/validation splitting is used throughout, since evaluating a forecasting model on randomly shuffled data would let it see future information during training, a genuine and common mistake this project deliberately avoids and explains.

Built On

•ML Module — Gradient Boosting (XGBoost), the primary model, benchmarked against a naive baseline here
•ML Module — Feature Engineering, the central technique this entire project depends on
•ML Module — Cross-Validation, extended here with a time-aware splitting strategy instead of random k-fold
•ML Module — Evaluation Metrics, using a business-weighted error metric instead of plain RMSE
•DL Module 21-25 (for comparison) — RNNs and sequence models, the alternative approach this project explicitly benchmarks against and explains why it is not the better choice here

Step 1 — Exploring Real Sales Data and Its Gaps

Retail sales data has real structure worth understanding before any model gets built: strong weekly seasonality (weekends differ sharply from weekdays), holiday effects that can spike or suppress sales depending on the specific holiday and store type, and genuine data gaps where stores were closed for refurbishment, which must be identified and handled rather than treated as ordinary zero-sales days.

Real Sales Data Has Structure a Model Must Be Given, Not Left to Discover Alone

Strong weekly and holiday patterns are visible directly in the raw data. Feature engineering in Step 2 encodes these patterns explicitly, rather than expecting XGBoost to infer temporal structure it has no inherent mechanism for.

01_explore_sales_data.py

python

1import pandas as pd
2import numpy as np
3
4sales = pd.read_csv("./rossmann/train.csv", parse_dates=["Date"])
5stores = pd.read_csv("./rossmann/store.csv")
6
7print(f"Total records: {len(sales):,}")
8print(f"Date range: {sales['Date'].min()} to {sales['Date'].max()}")
9print(f"Unique stores: {sales['Store'].nunique()}\n")
10
11# ─── DISTINGUISHING REAL CLOSURES FROM LEGITIMATE ZERO SALES ───────────
12zero_sales_open = sales[(sales["Sales"] == 0) & (sales["Open"] == 1)]
13closed_days = sales[sales["Open"] == 0]
14
15print(f"Days marked OPEN with zero sales (likely a data quality issue): {len(zero_sales_open):,}")
16print(f"Days marked CLOSED (legitimate, e.g. refurbishment or Sundays): {len(closed_days):,}")
17print(f"Percentage of all records that are closed days: {len(closed_days)/len(sales):.1%}\n")
18
19# ─── WEEKLY SEASONALITY, MEASURED DIRECTLY ─────────────────────────────
20sales["DayOfWeek_name"] = sales["Date"].dt.day_name()
21avg_sales_by_day = sales[sales["Open"] == 1].groupby("DayOfWeek_name")["Sales"].mean()
22avg_sales_by_day = avg_sales_by_day.reindex(
23    ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
24)
25print("=== AVERAGE SALES BY DAY OF WEEK (open days only) ===")
26print(avg_sales_by_day.round(0))
27
28# ─── HOLIDAY EFFECT, MEASURED DIRECTLY ─────────────────────────────────
29avg_sales_holiday = sales[sales["StateHoliday"] != "0"]["Sales"].mean()
30avg_sales_normal = sales[sales["StateHoliday"] == "0"]["Sales"].mean()
31print(f"\nAverage sales on state holidays: {avg_sales_holiday:.0f}")
32print(f"Average sales on normal days:    {avg_sales_normal:.0f}")
33print("""
34This confirms real, measurable weekly and holiday patterns exist in
35the raw data -- Step 2's feature engineering exists specifically to
36give XGBoost direct access to these patterns as input columns,
37rather than hoping the model somehow infers them from a raw date
38column alone, which XGBoost has no special ability to do.
39""")

Gotchas

⚠Days marked Open=1 with zero sales are a genuine data quality issue, distinct from days marked Open=0 (legitimate closures) — these should be investigated and likely excluded from training, since they represent a data recording error, not a real business pattern to learn from.
⚠A raw date column alone provides essentially no useful signal to XGBoost — the model has no built-in understanding that a date represents a day of the week or proximity to a holiday; every one of these patterns must be extracted into explicit, separate feature columns for the model to use at all.
⚠Store closures for refurbishment can last days or weeks and should be excluded from training entirely for that store during that period, rather than being averaged into rolling features in Step 2, which would otherwise corrupt lag and rolling-average calculations with artificial zeros.

Step 2 — Engineering Time-Aware Features

This step creates the exact features that let a fundamentally non-sequential model like XGBoost handle a genuinely sequential problem well: lag features (what were sales exactly 7 and 14 days ago, capturing weekly rhythm), rolling averages (trailing 7-day and 30-day means, capturing recent trend), and explicit calendar encodings. Every lag and rolling feature is computed strictly using only past data relative to each row, since using any future information here would make the resulting model look better than it actually could ever perform in real, live use.

Lag and Rolling Features — Only Looking Backward

A lag-7 feature and a rolling-7-day average are both computed using only sales from before the current day being predicted. Any leakage of future data into these features would make the model artificially, and falsely, look accurate.

02_temporal_feature_engineering.py

python

1import pandas as pd
2import numpy as np
3
4def create_calendar_features(df: pd.DataFrame) -> pd.DataFrame:
5    df = df.copy()
6    df["day_of_week"] = df["Date"].dt.dayofweek
7    df["month"] = df["Date"].dt.month
8    df["day_of_month"] = df["Date"].dt.day
9    df["is_weekend"] = (df["day_of_week"] >= 5).astype(int)
10    df["is_promo"] = df["Promo"]
11    df["is_state_holiday"] = (df["StateHoliday"] != "0").astype(int)
12    return df
13
14def create_lag_and_rolling_features(df: pd.DataFrame) -> pd.DataFrame:
15    """CRITICAL: sort by store and date first, and use only PAST data
16    for every lag and rolling feature -- never future data relative to
17    the row being predicted."""
18    df = df.sort_values(["Store", "Date"]).copy()
19
20    # groupby("Store") ensures lag/rolling features never leak across
21    # DIFFERENT stores' sales histories into each other
22    df["sales_lag_7"] = df.groupby("Store")["Sales"].shift(7)
23    df["sales_lag_14"] = df.groupby("Store")["Sales"].shift(14)
24
25    # .shift(1) before the rolling window ensures TODAY's sales are
26    # never included in its own rolling average -- a subtle, common bug
27    df["sales_rolling_7day_avg"] = (
28        df.groupby("Store")["Sales"].shift(1).rolling(window=7).mean().reset_index(level=0, drop=True)
29    )
30    df["sales_rolling_30day_avg"] = (
31        df.groupby("Store")["Sales"].shift(1).rolling(window=30).mean().reset_index(level=0, drop=True)
32    )
33
34    return df
35
36sales = pd.read_csv("./rossmann/train.csv", parse_dates=["Date"])
37
38# Exclude closed days entirely, following Step 1's finding -- these are
39# not legitimate zero-sales observations to train or forecast on
40sales = sales[sales["Open"] == 1].copy()
41
42sales = create_calendar_features(sales)
43sales = create_lag_and_rolling_features(sales)
44
45# Rows at the very start of each store's history will have NaN lag/rolling
46# features (no 7 days of prior history yet) -- these must be dropped,
47# not filled, since filling them would fabricate history that never existed
48sales = sales.dropna(subset=["sales_lag_7", "sales_lag_14", "sales_rolling_7day_avg"])
49
50print(f"Final training rows after dropping insufficient-history rows: {len(sales):,}")
51print(f"\nSample engineered features:")
52print(sales[["Store", "Date", "Sales", "sales_lag_7", "sales_rolling_7day_avg", "is_weekend"]].head())

Gotchas

⚠groupby("Store") before shifting or computing rolling windows is essential — without it, a lag feature for the first few days of Store B's history would incorrectly pull in Store A's most recent sales values, since the data is sorted by store then date but pandas' shift/rolling would otherwise operate across the whole dataframe as if it were one continuous sequence.
⚠The extra .shift(1) before .rolling(window=7) is a subtle but critical detail — without it, the rolling average for a given day would include that SAME day's own sales value, meaning the model would partially be given the answer it's supposed to predict, a genuine, easy-to-miss form of data leakage.
⚠Dropping rows with insufficient history (rather than filling them with 0 or an average) is the correct choice here — filling would fabricate a plausible-looking but entirely fictional sales history for the first two weeks of every store's data, which would corrupt training on exactly those rows.

Step 3 — Time-Aware Validation and Training XGBoost

The validation split here must respect time order: the model trains on earlier dates and validates on strictly later dates, never a random shuffle. This directly mirrors how the model will actually be used in production — forecasting the future from the past — and a random split would let the model validate on dates that come chronologically before some of its training data, silently inflating the reported accuracy in a way that would not hold up in real deployment.

03_time_aware_training.py

python

1import xgboost as xgb
2import numpy as np
3import pandas as pd
4
5feature_columns = [
6    "day_of_week", "month", "day_of_month", "is_weekend", "is_promo",
7    "is_state_holiday", "sales_lag_7", "sales_lag_14",
8    "sales_rolling_7day_avg", "sales_rolling_30day_avg",
9]
10
11# TIME-AWARE SPLIT: train on the first 80% of dates chronologically,
12# validate on the LAST 20% -- never a random shuffle, which would
13# let the model "see" the future during training
14split_date = sales["Date"].quantile(0.8)   # the date marking the 80th percentile chronologically
15train_data = sales[sales["Date"] <= split_date]
16val_data = sales[sales["Date"] > split_date]
17
18print(f"Training on dates up to {split_date.date()}: {len(train_data):,} rows")
19print(f"Validating on dates after {split_date.date()}: {len(val_data):,} rows")
20
21X_train, y_train = train_data[feature_columns], train_data["Sales"]
22X_val, y_val = val_data[feature_columns], val_data["Sales"]
23
24xgb_model = xgb.XGBRegressor(
25    n_estimators=500,
26    max_depth=6,
27    learning_rate=0.05,
28    subsample=0.8,
29    colsample_bytree=0.8,
30    early_stopping_rounds=30,
31    eval_metric="rmse",
32)
33
34xgb_model.fit(
35    X_train, y_train,
36    eval_set=[(X_val, y_val)],
37    verbose=False,
38)
39
40predictions = xgb_model.predict(X_val)
41
42# ─── A NAIVE BASELINE, FOR AN HONEST COMPARISON ────────────────────────
43# "Predict last week's same-day sales" -- the simplest reasonable
44# forecast a store manager might use without any model at all
45naive_predictions = val_data["sales_lag_7"].values
46
47def weighted_mape(actual, predicted, weights):
48    """Weighted Mean Absolute Percentage Error -- a 10% error on a
49    high-volume store matters more to the business than the same
50    percentage error on a low-volume store, which plain MAPE ignores."""
51    errors = np.abs((actual - predicted) / actual)
52    return np.average(errors, weights=weights) * 100
53
54xgb_wmape = weighted_mape(y_val.values, predictions, weights=y_val.values)
55naive_wmape = weighted_mape(y_val.values, naive_predictions, weights=y_val.values)
56
57print(f"\n=== HONEST COMPARISON AGAINST A NAIVE BASELINE ===\n")
58print(f"Naive baseline (same day last week) weighted MAPE: {naive_wmape:.2f}%")
59print(f"XGBoost weighted MAPE:                              {xgb_wmape:.2f}%")
60print(f"Improvement over naive baseline: {naive_wmape - xgb_wmape:.2f} percentage points")
61
62xgb_model.save_model("xgboost_demand_model.json")

Gotchas

⚠A random train/validation split, rather than this lesson's chronological split, would let the model validate on some dates that come BEFORE some of its own training dates — meaning the model could effectively be evaluated on data it has already indirectly seen the future consequences of, producing an optimistic, misleading validation score that would not hold up in real forecasting use.
⚠Always compare against a genuinely naive baseline (like 'same day last week', used here) before trusting a sophisticated model's reported error — if XGBoost cannot meaningfully beat this trivial baseline, the added complexity is not earning its keep, exactly the same standard of proof this course applies to every architectural or modeling choice.
⚠Weighted MAPE, weighting each store's error by its actual sales volume, is used here specifically because a retailer cares more about accurately forecasting a busy flagship store than an equally-percentage-wrong forecast on a tiny store — plain unweighted MAPE would treat both errors as equally important, which does not reflect the real business cost.

Step 4 — Serving Weekly Forecasts

The final endpoint accepts a store ID and the most recent sales history, computes the same lag and rolling features used during training, and returns a forecast — following the same load-once, preprocessing-must-match-training discipline established throughout every project in this course.

04_serve_forecast.py

python

1from fastapi import FastAPI
2from pydantic import BaseModel
3import xgboost as xgb
4import numpy as np
5
6app = FastAPI(title="Retail Demand Forecasting API")
7
8model = None
9
10@app.on_event("startup")
11def load_model():
12    global model
13    model = xgb.XGBRegressor()
14    model.load_model("xgboost_demand_model.json")
15    print("Demand forecasting model loaded.")
16
17class ForecastRequest(BaseModel):
18    day_of_week: int
19    month: int
20    day_of_month: int
21    is_weekend: int
22    is_promo: int
23    is_state_holiday: int
24    sales_last_7_days: list[float]    # most recent 7 daily sales figures, oldest to newest
25    sales_last_30_days: list[float]   # most recent 30 daily sales figures, oldest to newest
26
27@app.post("/forecast")
28def forecast_sales(request: ForecastRequest):
29    # Compute the SAME lag and rolling features used during training,
30    # from the recent sales history the caller provides
31    sales_lag_7 = request.sales_last_7_days[0]    # 7 days before the day being forecasted
32    sales_lag_14 = request.sales_last_30_days[-14] if len(request.sales_last_30_days) >= 14 else request.sales_last_30_days[0]
33    sales_rolling_7day_avg = float(np.mean(request.sales_last_7_days))
34    sales_rolling_30day_avg = float(np.mean(request.sales_last_30_days))
35
36    feature_vector = np.array([[
37        request.day_of_week, request.month, request.day_of_month,
38        request.is_weekend, request.is_promo, request.is_state_holiday,
39        sales_lag_7, sales_lag_14, sales_rolling_7day_avg, sales_rolling_30day_avg,
40    ]])
41
42    predicted_sales = float(model.predict(feature_vector)[0])
43
44    return {
45        "forecasted_sales": round(predicted_sales, 2),
46        "based_on_rolling_7day_avg": round(sales_rolling_7day_avg, 2),
47        "based_on_rolling_30day_avg": round(sales_rolling_30day_avg, 2),
48    }
49
50# Run with: uvicorn 04_serve_forecast:app --host 0.0.0.0 --port 8000

Gotchas

⚠The feature computation inside this endpoint must exactly mirror Step 2's training-time feature engineering, including the same column order the model was trained on — any drift here would silently produce meaningless forecasts, exactly the same preprocessing-consistency warning flagged in every prior project's serving step.
⚠This endpoint requires the CALLER to supply recent sales history rather than looking it up from a database directly — a real production version would instead query a live sales database internally for the requesting store's recent history, removing the burden from the API consumer and reducing the risk of the caller supplying stale or incorrect history.
⚠A single day's forecast, as built here, is the foundation for a full week's forecast — a real deployment would call this repeatedly, feeding each day's OWN forecasted output back in as part of the next day's rolling average input, a genuine multi-step forecasting extension worth building on top of this single-step foundation.