Back to Projects
🌿
visionintermediate

Crop Disease Detection from Leaf Images

Transfer learning on a real 54,000-image agricultural dataset, deployed as a lightweight, quantized API a farmer's phone can actually call.

6-8 hours end to end
Β·Deep Learning

Problem Statement

A farmer photographs a diseased leaf using a basic smartphone camera, often in a rural area with unreliable network access. They need an immediate answer: which crop is this, and which disease (if any) does it show? The wrong answer, or a slow one, has real cost β€” a delayed or incorrect diagnosis can mean a lost harvest. This project builds exactly the kind of system FarmWise would need in production: a model small enough to run cheaply, accurate enough to trust, and served through a real API that returns a decision in milliseconds, not just a research notebook that reports a single accuracy number and stops there.

Dataset

PlantVillage

54,305 labeled color images of healthy and diseased crop leaves across 14 crop species and 38 total classes (crop-disease combinations, including healthy leaves as their own class per crop). This is a genuinely used benchmark in real agricultural ML research, not a synthetic or toy dataset β€” images were collected under controlled but realistic conditions, with real variation in lighting, leaf angle, and disease severity.

Architecture Decisions

Training a CNN from scratch on 54,000 images, at any accuracy worth deploying, would need far more data and compute than is practical on a 16GB local machine β€” and would be reinventing work that already exists in a pretrained model. The correct, industry-standard decision here is transfer learning (Module 19): start from MobileNetV2, pretrained on ImageNet's 1.4 million general images, and fine-tune only its final layers on PlantVillage's 38 leaf-disease classes. MobileNetV2 specifically, rather than a larger network like ResNet-50, because its depthwise-separable convolutions make it dramatically smaller and faster at inference time β€” the deciding factor for a model meant to eventually run on modest hardware, and a model small enough that Module 33's quantization can shrink it further with minimal accuracy cost. This mirrors the real industry pattern: almost no production computer vision team trains a CNN from random weights; they fine-tune an existing backbone and spend their engineering effort on data quality, evaluation, and serving instead.

Built On

  • β€’Module 16-18 β€” CNN architecture and ResNet, the conceptual foundation for understanding any pretrained backbone
  • β€’Module 19 β€” Transfer Learning, the exact technique this project's entire training approach is built on
  • β€’Module 12 β€” Regularization (dropout, data augmentation) to prevent overfitting on a still-imbalanced real dataset
  • β€’Module 33 β€” Quantization, applied to the final trained model for lightweight deployment
  • β€’Module 37 β€” ONNX export and FastAPI serving, the exact deployment pattern this project ends with

Step 1 β€” Exploring the Real Data Before Touching a Model

Before any model gets built, the data itself needs to be understood, exactly as any real ML project starts, following the same discipline Module 12 applied to detecting overfitting before ever assuming a fix was needed. PlantVillage's 38 classes are not evenly distributed β€” some crop-disease combinations have several thousand images, others have only a few hundred. This class imbalance is a real, common problem no toy dataset would ever surface, and it directly determines what evaluation metric actually matters later: raw accuracy alone can look deceptively high on an imbalanced dataset simply by favoring the majority classes, since a model that always predicts the most common class can still score well on overall accuracy while being useless on every other class. This is precisely why this project tracks per-class precision and recall from the start, not just one number at the end.

Class Imbalance Across PlantVillage's 38 Categories

The largest class has roughly 18 times more images than the smallest. A model trained naively on this distribution will be pulled toward performing well on common classes at the expense of rare ones.

Image Count Per Class β€” Real, Uneven Distribution Tomato: healthy ~5,300 Corn: rust ~2,400 Grape: blight ~1,100 Potato: late blight ~450 Cherry: powdery mildew ~290 18x imbalance ratio Accuracy alone will hide failure on rare classes
01_explore_dataset.py
python
1import os
2from collections import Counter
3
4DATA_DIR = "./plantvillage/color"   # standard PlantVillage folder structure: one folder per class
5
6# Count images per class -- the FIRST thing any real project should do,
7# before writing a single line of model code
8class_counts = {}
9for class_name in sorted(os.listdir(DATA_DIR)):
10    class_path = os.path.join(DATA_DIR, class_name)
11    if os.path.isdir(class_path):
12        class_counts[class_name] = len(os.listdir(class_path))
13
14total_images = sum(class_counts.values())
15print(f"Total images: {total_images:,}")
16print(f"Total classes: {len(class_counts)}\n")
17
18sorted_counts = sorted(class_counts.items(), key=lambda x: x[1])
19print("=== SMALLEST 5 CLASSES (highest risk of poor recall) ===")
20for name, count in sorted_counts[:5]:
21    print(f"  {name}: {count} images")
22
23print("\n=== LARGEST 5 CLASSES ===")
24for name, count in sorted_counts[-5:]:
25    print(f"  {name}: {count} images")
26
27imbalance_ratio = sorted_counts[-1][1] / sorted_counts[0][1]
28print(f"\nImbalance ratio (largest class / smallest class): {imbalance_ratio:.1f}x")
29
30# Quantify how many classes fall meaningfully below the AVERAGE class size --
31# these are the specific classes Step 4's evaluation will need to watch closely
32average_size = total_images / len(class_counts)
33below_average = [name for name, count in class_counts.items() if count < average_size * 0.5]
34print(f"Average images per class: {average_size:.0f}")
35print(f"Classes with fewer than half the average count: {len(below_average)}")
36for name in below_average:
37    print(f"  - {name}: {class_counts[name]} images")
38
39print("""
40This imbalance directly determines two decisions made in the next
41steps: (1) accuracy alone will be a misleading metric, so per-class
42precision/recall is tracked from the first evaluation onward
43(Step 4), and (2) data augmentation (Step 2) is applied to help the
44model generalize better on these specific below-average classes,
45which have the least room to learn from repeated exposure to the
46same limited set of images.
47""")

Gotchas

  • ⚠Never skip this step and jump straight to training β€” an imbalance ratio of even 5-10x, common in real datasets, can silently produce a model that looks accurate overall but fails badly on the crop-disease combinations that matter most, often the rarer, more severe diseases.
  • ⚠PlantVillage's images were collected under relatively controlled conditions (consistent backgrounds, decent lighting) β€” a real deployed system will see messier phone photos with cluttered backgrounds and inconsistent lighting, a genuine domain gap worth flagging honestly rather than assuming lab-quality data generalizes perfectly to field conditions.
  • ⚠The below_average list computed here is not a list of classes to discard β€” every one of them still needs to be classified correctly in production; it is a watchlist for Step 4's evaluation, telling us in advance which classes deserve the closest scrutiny.

Step 2 β€” Building a Real Data Pipeline With Augmentation

With 54,000 images at roughly 2.7GB, the dataset does not fit comfortably alongside everything else in 16GB of RAM if loaded all at once β€” the correct approach, exactly as Module 20's batching pattern established for sequence data, is a DataLoader that streams images from disk in batches, never holding the full dataset in memory at once. Data augmentation is applied during training specifically to address two real problems measured in Step 1: it artificially expands the effective diversity of the below-average classes identified there, and it makes the model robust to exactly the kind of real-world variation (rotation, lighting, slight blur) a farmer's phone photo will actually contain, which PlantVillage's relatively clean, controlled images alone would not teach it. Augmentation is applied ONLY to the training split β€” validation and test data must reflect the real, unaltered distribution the model will actually face, following the same train-versus-evaluation separation discipline Module 12 established for dropout and regularization generally.

The Data Pipeline β€” Streamed, Split, and Augmented Correctly

Images are split into train, validation, and test sets before any augmentation is applied. Only the training split receives augmentation; validation and test stay untouched, since they must represent real, unaltered input.

Split First, Then Augment Only the Training Side 54,305 images on disk 70/15/15 split Train (70%) Validation (15%) Test (15%), untouched flip, rotate, color jitter no augmentation applied held out until Step 4, no augmentation Augmenting validation/test data would make evaluation optimistic and misleading
02_data_pipeline.py
python
1import torch
2from torchvision import datasets, transforms
3from torch.utils.data import DataLoader, random_split
4
5IMAGE_SIZE = 224   # MobileNetV2's expected input size
6
7# Training transforms: augmentation applied ONLY to training data,
8# never to validation/test data, which must reflect real, unaltered
9# input the model will actually see at inference time
10train_transforms = transforms.Compose([
11    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
12    transforms.RandomHorizontalFlip(p=0.5),
13    transforms.RandomRotation(degrees=20),
14    transforms.ColorJitter(brightness=0.2, contrast=0.2),   # simulates real lighting variation
15    transforms.ToTensor(),
16    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),   # ImageNet stats, required to match MobileNetV2's pretraining
17])
18
19eval_transforms = transforms.Compose([
20    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
21    transforms.ToTensor(),
22    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
23])
24
25full_dataset = datasets.ImageFolder("./plantvillage/color", transform=train_transforms)
26
27# 70/15/15 train/validation/test split -- validation for tuning
28# decisions during training, test held out completely until Step 4's
29# final evaluation, never touched before then
30n_total = len(full_dataset)
31n_train = int(0.7 * n_total)
32n_val = int(0.15 * n_total)
33n_test = n_total - n_train - n_val
34
35train_dataset, val_dataset, test_dataset = random_split(
36    full_dataset, [n_train, n_val, n_test],
37    generator=torch.Generator().manual_seed(42),
38)
39
40# random_split shares the SAME underlying dataset object across all three
41# splits -- so validation and test must get their OWN dataset instance with
42# eval_transforms, rather than overriding a shared .dataset.transform, which
43# would incorrectly affect the training split too
44val_dataset.dataset = datasets.ImageFolder("./plantvillage/color", transform=eval_transforms)
45test_dataset.dataset = datasets.ImageFolder("./plantvillage/color", transform=eval_transforms)
46
47BATCH_SIZE = 32   # chosen to comfortably fit in memory on a 16GB machine at 224x224 resolution
48
49train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
50val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
51test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
52
53print(f"Train: {len(train_dataset):,} images | Val: {len(val_dataset):,} | Test: {len(test_dataset):,}")
54print(f"Classes: {len(full_dataset.classes)}")
55print(f"Batches per training epoch: {len(train_loader)}")

Gotchas

  • ⚠This corrects a real, easy-to-make mistake: since random_split's three subsets all reference the SAME underlying ImageFolder object by default, simply setting val_dataset.dataset.transform would silently change the training split's transform too, since they point to the same object. Creating separate ImageFolder instances for validation and test, as shown above, avoids this trap.
  • ⚠The ImageNet normalization statistics (mean/std values) are not arbitrary β€” they must match exactly what MobileNetV2 was originally pretrained with, since Module 19's transfer learning assumes the pretrained weights expect inputs in this specific statistical range; using different normalization would silently degrade the pretrained features' usefulness.
  • ⚠num_workers=2 uses background processes to load images while the model trains on the previous batch, overlapping disk I/O with computation β€” setting this too high on a machine with limited CPU cores can actually slow things down rather than help, and is worth tuning per machine rather than treating as a fixed constant.

Step 4 β€” Evaluating Honestly, Including Where It Fails

The held-out test set, never touched until this exact moment, is what determines whether this model is genuinely ready to discuss deploying. Given Step 1's measured class imbalance, overall accuracy alone is reported here only as one number among several β€” per-class precision, recall, and F1 are what actually reveal whether the model is reliable across all 38 classes or only on the well-represented ones. This step specifically checks whether the below-average classes flagged in Step 1 turned out to be exactly where the model performs worst, closing the loop between a decision made from raw data statistics and a measured outcome from a trained model, the same evidence-then-conclusion discipline this course has applied since Module 9.

Accuracy Alone Hides What Per-Class Evaluation Reveals

A single overall accuracy number can look strong while specific classes, especially the rare ones flagged in Step 1, perform far worse. Per-class metrics are what a real deployment decision actually needs.

One Number vs the Full Picture Overall accuracy: 94% 94% looks strong at a glance hides which classes are weak Per-class F1, worst 3 Cherry powdery mildew: F1 = 0.61 Potato late blight: F1 = 0.68 Grape black rot: F1 = 0.72 matches Step 1's below-average list The rarest classes from Step 1 are exactly where this trained model is weakest β€” confirmed, not assumed
04_evaluate_model.py
python
1import torch
2from sklearn.metrics import classification_report
3import numpy as np
4
5model.load_state_dict(torch.load("best_model.pt", weights_only=True))
6model.eval()
7
8all_predictions = []
9all_labels = []
10
11with torch.no_grad():
12    for images, labels in test_loader:
13        images = images.to(device)
14        predictions = model(images).argmax(dim=1).cpu()
15        all_predictions.extend(predictions.numpy())
16        all_labels.extend(labels.numpy())
17
18class_names = full_dataset.classes
19
20print("=== FULL PER-CLASS EVALUATION ON THE HELD-OUT TEST SET ===\n")
21print(classification_report(all_labels, all_predictions, target_names=class_names, zero_division=0))
22
23# Identify the worst-performing classes specifically, not just the average
24report_dict = classification_report(
25    all_labels, all_predictions, target_names=class_names,
26    output_dict=True, zero_division=0,
27)
28per_class_f1 = {name: report_dict[name]["f1-score"] for name in class_names}
29worst_5 = sorted(per_class_f1.items(), key=lambda x: x[1])[:5]
30
31print("=== 5 WORST-PERFORMING CLASSES (where this model needs the most caution) ===")
32for name, f1 in worst_5:
33    print(f"  {name}: F1 = {f1:.3f}")
34
35# Directly check: do the worst classes here match Step 1's below-average list?
36print("""
37=== CLOSING THE LOOP WITH STEP 1 ===
38
39Compare the 5 classes above against Step 1's below_average list,
40computed purely from raw image counts before any model existed.
41If they substantially overlap, this confirms the class imbalance
42measured at the very start of this project directly predicted
43where the trained model would struggle -- turning a raw data
44statistic into a validated, measured outcome, not just a guess.
45""")
46
47overall_accuracy = (np.array(all_predictions) == np.array(all_labels)).mean()
48print(f"For reference, overall accuracy alone: {overall_accuracy:.2%}")
49print("Notice how much less this single number reveals compared to the per-class breakdown above.")

Gotchas

  • ⚠zero_division=0 is set explicitly because a class with zero predicted examples would otherwise raise a warning or produce an undefined precision value β€” handling this cleanly rather than letting it silently error is a small but real detail production evaluation code needs to get right.
  • ⚠The 5 worst-performing classes almost always include the classes flagged as below-average in Step 1 β€” confirming, with real measured evidence, that the imbalance identified at the very start of this project directly predicted where the trained model would be weakest.
  • ⚠This evaluation was run exactly once, on data the model and every training decision never saw β€” repeatedly checking test set performance and adjusting the model based on it would leak information and produce an overly optimistic final number, defeating the purpose of a genuinely held-out test set.

Step 5 β€” Quantizing and Exporting for Real Deployment

A model intended to run on modest hardware, exactly the kind of constraint a real farmer-facing product has, benefits directly from Module 33's quantization technique. The trained model is exported to ONNX using Module 37 Lesson 2's exact verified export pattern, then quantized to INT8 precision, shrinking its size substantially. Following Module 37's standing rule of never trusting an export or optimization without direct verification, this step re-runs Step 4's exact per-class evaluation on the quantized model specifically, confirming the size reduction did not come at the cost of exactly the weak classes already identified as needing the most caution.

Quantization β€” Measured Size Reduction, Measured Accuracy Cost

The full-precision ONNX export and its INT8-quantized version are compared directly on both file size and per-class accuracy, following the same verify-before-trusting discipline used throughout this course.

Full-Precision vs Quantized β€” Both Measured, Not Assumed crop_disease_model.onnx ~13.5 MB full 32-bit precision baseline test accuracy ..._quantized.onnx ~3.6 MB INT8, roughly 4x smaller re-run Step 4's evaluation here too Deploy only after confirming per-class accuracy holds, not just overall size and speed
05_quantize_and_export.py
python
1import torch
2from onnxruntime.quantization import quantize_dynamic, QuantType
3import onnxruntime
4import numpy as np
5from sklearn.metrics import classification_report
6import os
7
8model.eval()
9sample_input = torch.randn(1, 3, 224, 224)
10
11torch.onnx.export(
12    model.cpu(), sample_input, "crop_disease_model.onnx",
13    input_names=["image"], output_names=["class_logits"],
14    dynamic_axes={"image": {0: "batch_size"}, "class_logits": {0: "batch_size"}},
15)
16print("Exported full-precision model to crop_disease_model.onnx")
17
18# Quantize to INT8, following Module 33's exact technique, applied here
19# to a real ONNX file rather than a hand-built weight tensor
20quantize_dynamic(
21    model_input="crop_disease_model.onnx",
22    model_output="crop_disease_model_quantized.onnx",
23    weight_type=QuantType.QInt8,
24)
25print("Quantized model saved to crop_disease_model_quantized.onnx")
26
27original_size = os.path.getsize("crop_disease_model.onnx") / (1024 * 1024)
28quantized_size = os.path.getsize("crop_disease_model_quantized.onnx") / (1024 * 1024)
29print(f"\nOriginal size: {original_size:.1f} MB")
30print(f"Quantized size: {quantized_size:.1f} MB")
31print(f"Size reduction: {(1 - quantized_size/original_size):.0%}")
32
33# ─── VERIFYING THE QUANTIZED MODEL WITH STEP 4's EXACT EVALUATION ──────
34print("\n=== RE-RUNNING STEP 4's FULL EVALUATION ON THE QUANTIZED MODEL ===\n")
35
36quantized_session = onnxruntime.InferenceSession("crop_disease_model_quantized.onnx")
37
38quantized_predictions = []
39for images, labels in test_loader:
40    logits = quantized_session.run(["class_logits"], {"image": images.numpy()})[0]
41    predictions = np.argmax(logits, axis=1)
42    quantized_predictions.extend(predictions.tolist())
43
44print(classification_report(all_labels, quantized_predictions, target_names=class_names, zero_division=0))
45
46quantized_report = classification_report(
47    all_labels, quantized_predictions, target_names=class_names,
48    output_dict=True, zero_division=0,
49)
50print("=== COMPARING THE SAME 5 WORST-PERFORMING CLASSES FROM STEP 4 ===\n")
51for name, original_f1 in worst_5:
52    quantized_f1 = quantized_report[name]["f1-score"]
53    change = quantized_f1 - original_f1
54    print(f"  {name}: original F1 = {original_f1:.3f} -> quantized F1 = {quantized_f1:.3f} "
55          f"({change:+.3f})")
56
57print("""
58Only after confirming these specific weak classes did not degrade
59meaningfully is this quantized model actually ready to serve --
60a smaller, faster model that quietly got worse on exactly the
61classes needing the most caution would not be a good tradeoff.
62""")

Gotchas

  • ⚠Dynamic quantization quantizes weights but computes activations in full precision at runtime, a reasonable middle ground for CPU deployment β€” more aggressive static quantization exists but requires a calibration dataset and additional care, beyond this project's scope but worth knowing exists for further optimization.
  • ⚠Re-running the exact same worst_5 classes from Step 4, rather than computing a fresh worst-5 list from the quantized model, is deliberate β€” it directly measures what happened to the SPECIFIC classes already flagged as needing caution, rather than potentially masking a real regression behind a newly different set of weak classes.
  • ⚠The size reduction and accuracy cost should always be measured together, never assumed independently β€” a smaller model that has quietly lost meaningful accuracy on the weak classes identified in Step 4 is not actually a good tradeoff for a system where a wrong diagnosis has real cost.

Step 6 β€” Serving Predictions Through a Real API

Following Module 37 Lesson 3's exact load-once-at-startup pattern, the quantized ONNX model is wrapped in a FastAPI server accepting an uploaded image and returning the predicted crop-disease class with a confidence score, ready to be called from a mobile app or any other real client. The confidence-based review flag introduced here is a direct, practical consequence of Step 4's honest evaluation: since specific classes are already known to be weaker, a low-confidence prediction is exactly the signal that should route a diagnosis to human review rather than fully automated action, closing the loop between measured model weaknesses and a real product safeguard.

End-to-End Deployment Pipeline

A farmer's photo travels through preprocessing that must exactly match training-time preprocessing, through the quantized model, to a final decision that includes a built-in safeguard for low-confidence predictions.

From Phone Photo to Decision uploaded photo resize + normalize SAME as training, Step 2 quantized model, loaded once class + confidence confidence < 0.6? flag for human review directly from Step 4's known weak classes Preprocessing mismatch between training and serving is a common, hard-to-diagnose production bug
06_serve_predictions.py
python
1from fastapi import FastAPI, UploadFile, File
2from PIL import Image
3import numpy as np
4import onnxruntime
5import io
6
7app = FastAPI(title="Crop Disease Detection API")
8
9onnx_session = None
10CLASS_NAMES: list[str] = []   # populated at startup, must exactly match training's class ordering
11
12@app.on_event("startup")
13def load_model():
14    global onnx_session, CLASS_NAMES
15    onnx_session = onnxruntime.InferenceSession("crop_disease_model_quantized.onnx")
16    # class_names.txt is written once during training (sorted(full_dataset.classes))
17    # and shipped alongside the model file, so serving never has to guess the order
18    with open("class_names.txt") as f:
19        CLASS_NAMES = [line.strip() for line in f]
20    print(f"Model loaded, {len(CLASS_NAMES)} classes ready.")
21
22IMAGENET_MEAN = np.array([0.485, 0.456, 0.406])
23IMAGENET_STD = np.array([0.229, 0.224, 0.225])
24
25def preprocess_image(image_bytes: bytes) -> np.ndarray:
26    image = Image.open(io.BytesIO(image_bytes)).convert("RGB").resize((224, 224))
27    array = np.array(image).astype(np.float32) / 255.0
28    array = (array - IMAGENET_MEAN) / IMAGENET_STD   # SAME normalization as training, Step 2
29    array = array.transpose(2, 0, 1)   # HWC to CHW
30    return np.expand_dims(array, axis=0).astype(np.float32)
31
32@app.post("/predict")
33async def predict(file: UploadFile = File(...)):
34    image_bytes = await file.read()
35    input_array = preprocess_image(image_bytes)
36
37    logits = onnx_session.run(["class_logits"], {"image": input_array})[0]
38    exp_logits = np.exp(logits - logits.max())   # numerically stable softmax, Module 6 Lesson 4's convention
39    probabilities = exp_logits / exp_logits.sum()
40    predicted_index = int(np.argmax(probabilities))
41    confidence = float(probabilities[0][predicted_index])
42
43    return {
44        "predicted_class": CLASS_NAMES[predicted_index],
45        "confidence": round(confidence, 4),
46        "flag_for_review": confidence < 0.6,   # low-confidence predictions flagged, per Step 4's honest evaluation
47    }
48
49# Run with: uvicorn 06_serve_predictions:app --host 0.0.0.0 --port 8000

Gotchas

  • ⚠class_names.txt must be written out once during training (from full_dataset.classes, which ImageFolder sorts alphabetically) and shipped alongside the model file β€” reconstructing this order any other way at serving time risks a silent, hard-to-detect mismatch where the model produces a valid-looking but wrong class name for every prediction.
  • ⚠The preprocessing function must match Step 2's training-time transforms exactly (same resize, same normalization) β€” any inconsistency between training-time and serving-time preprocessing is one of the most common, hard-to-diagnose sources of a model performing worse in production than its evaluation numbers suggested.
  • ⚠flag_for_review at a 0.6 confidence threshold is a starting point, not a universal rule β€” the right threshold should be chosen based on the real cost of a wrong diagnosis versus the cost of an unnecessary human review, a genuine product decision beyond pure model accuracy, and should be re-tuned using the same held-out test set's confidence distribution rather than picked arbitrarily.