Environmental Sound Classification for Smart Monitoring

Real urban audio, converted to spectrograms and classified with the exact CNN techniques from Module 16-19 — proving vision and audio share the same underlying tool.

5-7 hours end to end

·Deep Learning

Problem Statement

A smart building or public safety monitoring system has a live microphone feed and needs to recognize specific sound events in real time — glass breaking, a car horn, a drill, a siren — to trigger the correct alert without a human listening around the clock. This is a genuine, deployed use case in smart-city and security systems, and it raises a question worth answering directly rather than assuming: can the exact convolutional techniques built for images (Module 14-19) work on sound at all? This project proves the answer is yes, by converting raw audio into an image-like representation first, then reusing this course's existing CNN toolkit almost unchanged.

Dataset

UrbanSound8K

8,732 labeled short audio clips (up to 4 seconds each) of urban sound events across 10 classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. This is a standard, widely used benchmark in real audio machine learning research, collected from real field recordings with genuine background noise, not studio-clean samples.

~5.6 GB, 8,732 audio clips, 10 classesUrbanSound8K (Salamon, Jacoby, Bello, 2014), publicly available for research use

Architecture Decisions

The key architectural decision in this project is not a new model at all — it is a data representation choice. Raw audio is a 1-dimensional waveform, but converting it into a Mel-spectrogram, a 2-dimensional image showing how sound energy is distributed across frequency and time, turns an audio classification problem into an image classification problem. Once that conversion happens, this project reuses a small CNN built directly from Module 16-17's proven conv-pool-conv-pool pattern, since UrbanSound8K's 8,732 clips is a modest enough dataset that a compact, purpose-built CNN trained from scratch is both sufficient and faster to iterate on than fine-tuning a large pretrained vision backbone — a genuine, deliberate contrast with Project 1's transfer-learning decision, made because the two datasets differ in size and the images (spectrograms) differ in visual structure from natural photographs.

Built On

•Module 14 — The Convolution Operation, applied here to frequency-time patterns instead of pixels
•Module 16-17 — CNN architecture patterns (conv-pool blocks), reused directly for the spectrogram classifier
•Module 12 — Regularization, applied to prevent overfitting on this modest-sized dataset
•Module 33 — Quantization, applied to the trained model exactly as in Project 1
•Module 37 — FastAPI serving, extended here to accept audio file uploads instead of images

Step 1 — Exploring Real Audio Data

Audio data carries its own real, practical complications that a toy dataset would never surface: clips vary in duration (some under 1 second, most close to 4), sample rate can differ between recordings, and some clips are mono while others are stereo. Every one of these inconsistencies must be resolved before a single spectrogram gets computed, exactly the kind of unglamorous but essential data auditing step Project 1's Step 1 established as the correct starting point for any real project.

Real Audio Clips Are Not Uniform

Duration, sample rate, and channel count vary across real recordings. Every clip must be standardized to the same format before a spectrogram can be computed consistently.

01_explore_audio_data.py

python

1import pandas as pd
2import librosa
3import os
4
5metadata = pd.read_csv("./UrbanSound8K/metadata/UrbanSound8K.csv")
6
7print(f"Total clips: {len(metadata)}")
8print(f"Classes: {sorted(metadata['class'].unique())}\n")
9
10print("=== CLASS DISTRIBUTION ===")
11print(metadata["class"].value_counts())
12
13# Audit real inconsistencies across a sample of clips -- exactly the
14# unglamorous but essential step before any spectrogram gets computed
15sample_files = metadata.sample(10, random_state=42)
16print("\n=== AUDITING RAW AUDIO FORMAT ACROSS A SAMPLE ===\n")
17print(f"{'File':>20} | {'Duration (s)':>12} | {'Sample rate':>12} | {'Channels':>9}")
18print("-" * 65)
19
20for _, row in sample_files.iterrows():
21    file_path = f"./UrbanSound8K/audio/fold{row['fold']}/{row['slice_file_name']}"
22    waveform, sample_rate = librosa.load(file_path, sr=None, mono=False)
23    duration = librosa.get_duration(y=waveform, sr=sample_rate)
24    n_channels = 1 if waveform.ndim == 1 else waveform.shape[0]
25    print(f"{row['slice_file_name'][:20]:>20} | {duration:>12.2f} | {sample_rate:>12} | {n_channels:>9}")
26
27print("""
28=== THE DECISION THIS FORCES ===
29
30Every clip needs to be resampled to the SAME sample rate,
31converted to mono, and either trimmed or padded to the SAME
32duration before a spectrogram can be computed consistently across
33the whole dataset -- this standardization is Step 2's first job,
34done once and cached, rather than repeated on every training epoch.
35""")

Gotchas

⚠librosa.load with sr=None preserves the file's original sample rate for auditing purposes — in Step 2's actual training pipeline, every file will be explicitly resampled to one fixed rate, since training on mixed sample rates would make spectrograms inconsistent in a way that has nothing to do with the actual sound being classified.
⚠Some UrbanSound8K classes (like gun_shot) have noticeably fewer clips than others (like dog_bark or children_playing) — a smaller-scale version of Project 1's exact class imbalance concern, meaning per-class evaluation matters here too, not just overall accuracy.

Step 2 — Converting Audio to Spectrograms

A Mel-spectrogram represents a sound clip as a 2D image: one axis is time, the other is frequency (on the Mel scale, which approximates how humans perceive pitch differences), and the pixel intensity at each point shows how much energy exists at that frequency at that moment. This conversion is what makes Module 14's convolution operation directly applicable: a drill's characteristic high-frequency buzz and a dog bark's short, broadband burst look visually distinct as spectrogram patterns, and a CNN's learned filters (Module 14 Lesson 2) can detect these visual patterns exactly as they would detect an edge or a texture in a photograph.

Audio Becomes an Image — Frequency vs Time

A Mel-spectrogram turns a 1-dimensional waveform into a 2-dimensional image. Different sound events produce visually distinct patterns, which is exactly what lets a CNN classify them the same way it classifies photographs.

02_audio_to_spectrogram.py

python

1import torch
2import torchaudio
3import torchaudio.transforms as T
4from torch.utils.data import Dataset
5import pandas as pd
6
7TARGET_SAMPLE_RATE = 22050
8TARGET_DURATION_SECONDS = 4
9TARGET_LENGTH = TARGET_SAMPLE_RATE * TARGET_DURATION_SECONDS
10
11class UrbanSoundSpectrogramDataset(Dataset):
12    """Converts each raw audio clip into a standardized Mel-spectrogram,
13    resolving Step 1's exact inconsistencies (sample rate, duration,
14    channel count) before the spectrogram is ever computed."""
15
16    def __init__(self, metadata_df, audio_dir, transform_to_spectrogram=True):
17        self.metadata = metadata_df.reset_index(drop=True)
18        self.audio_dir = audio_dir
19        self.mel_spectrogram = T.MelSpectrogram(
20            sample_rate=TARGET_SAMPLE_RATE, n_mels=64, n_fft=1024, hop_length=512,
21        )
22        self.amplitude_to_db = T.AmplitudeToDB()
23        self.transform_to_spectrogram = transform_to_spectrogram
24
25    def __len__(self):
26        return len(self.metadata)
27
28    def _standardize_audio(self, waveform, original_sample_rate):
29        # Convert to mono: average across channels if stereo
30        if waveform.shape[0] > 1:
31            waveform = waveform.mean(dim=0, keepdim=True)
32
33        # Resample to the fixed target rate if it differs
34        if original_sample_rate != TARGET_SAMPLE_RATE:
35            resampler = T.Resample(original_sample_rate, TARGET_SAMPLE_RATE)
36            waveform = resampler(waveform)
37
38        # Pad short clips with silence, or trim long ones -- fixed length
39        # is required for every spectrogram to have the same final shape
40        current_length = waveform.shape[1]
41        if current_length < TARGET_LENGTH:
42            padding = TARGET_LENGTH - current_length
43            waveform = torch.nn.functional.pad(waveform, (0, padding))
44        else:
45            waveform = waveform[:, :TARGET_LENGTH]
46
47        return waveform
48
49    def __getitem__(self, index):
50        row = self.metadata.iloc[index]
51        file_path = f"{self.audio_dir}/fold{row['fold']}/{row['slice_file_name']}"
52
53        waveform, original_sample_rate = torchaudio.load(file_path)
54        waveform = self._standardize_audio(waveform, original_sample_rate)
55
56        if self.transform_to_spectrogram:
57            spectrogram = self.mel_spectrogram(waveform)
58            spectrogram = self.amplitude_to_db(spectrogram)   # log scale, matching human loudness perception
59            return spectrogram, row["classID"]
60
61        return waveform, row["classID"]
62
63metadata = pd.read_csv("./UrbanSound8K/metadata/UrbanSound8K.csv")
64dataset = UrbanSoundSpectrogramDataset(metadata, "./UrbanSound8K/audio")
65
66sample_spectrogram, sample_label = dataset[0]
67print(f"Spectrogram shape: {tuple(sample_spectrogram.shape)}  (channels, n_mels, time_frames)")
68print(f"Label: {sample_label}")
69print(f"""
70Every clip, regardless of its ORIGINAL duration or sample rate, now
71produces a spectrogram of this exact same shape -- ready to be fed
72into a CNN exactly as a batch of standardized images would be.
73""")

Gotchas

⚠n_mels=64 controls the spectrogram's frequency resolution (its height as an image) — this is a genuine architectural choice, not an arbitrary default; more mel bands capture finer frequency detail at the cost of a larger input for the CNN to process.
⚠AmplitudeToDB converts the spectrogram to a logarithmic (decibel) scale specifically because human hearing perceives loudness logarithmically, not linearly — skipping this step would leave the spectrogram dominated by a few very loud moments, making quieter but still meaningful sound patterns much harder for a CNN to learn to detect.
⚠Padding short clips with silence rather than looping or stretching the audio is a deliberate choice — looping could create artificial, repeating patterns not present in the original real sound event, which the model might learn to rely on incorrectly.

Step 3 — Training a CNN on Spectrograms

With a modest dataset of under 9,000 clips, a compact CNN built directly from Module 16-17's conv-pool-conv-pool pattern, trained from scratch, is the right choice here — a deliberate contrast with Project 1's transfer learning decision. Spectrograms are visually quite different from natural photographs (structured frequency bands rather than photographic textures), so a large ImageNet-pretrained backbone's learned features are less directly transferable here than they were for real leaf photographs in Project 1, and a small dataset this size is enough to train a compact, purpose-built CNN well without overfitting, provided dropout (Module 12) is applied.

A Compact CNN, Built From Module 16-17's Proven Pattern

Two conv-pool blocks extract increasingly abstract patterns from the spectrogram, followed by a small dense classifier — the same architecture shape used throughout this course's CNN modules, applied here to audio.

03_train_spectrogram_cnn.py

python

1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, random_split
4
5class SpectrogramCNN(nn.Module):
6    """Module 16-17's exact conv-pool-conv-pool pattern, applied to
7    spectrograms instead of photographs."""
8    def __init__(self, n_classes=10):
9        super().__init__()
10        self.conv_block1 = nn.Sequential(
11            nn.Conv2d(1, 16, kernel_size=3, padding=1),
12            nn.ReLU(),
13            nn.MaxPool2d(2),
14        )
15        self.conv_block2 = nn.Sequential(
16            nn.Conv2d(16, 32, kernel_size=3, padding=1),
17            nn.ReLU(),
18            nn.MaxPool2d(2),
19        )
20        self.dropout = nn.Dropout(0.3)   # Module 12's regularization, needed on this modest-sized dataset
21        # 64 mel bands, 173 time frames (4 sec at this hop_length) -> halved twice by pooling
22        self.classifier = nn.Linear(32 * 16 * 43, n_classes)
23
24    def forward(self, x):
25        x = self.conv_block1(x)
26        x = self.conv_block2(x)
27        x = x.view(x.size(0), -1)
28        x = self.dropout(x)
29        return self.classifier(x)
30
31device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32
33full_dataset = UrbanSoundSpectrogramDataset(metadata, "./UrbanSound8K/audio")
34
35n_total = len(full_dataset)
36n_train = int(0.7 * n_total)
37n_val = int(0.15 * n_total)
38n_test = n_total - n_train - n_val
39train_dataset, val_dataset, test_dataset = random_split(
40    full_dataset, [n_train, n_val, n_test], generator=torch.Generator().manual_seed(42),
41)
42
43BATCH_SIZE = 16   # smaller than Project 1's, since spectrograms with padding are larger per-sample
44train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
45val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
46test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
47
48model = SpectrogramCNN(n_classes=10).to(device)
49optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
50loss_fn = nn.CrossEntropyLoss()
51
52best_val_accuracy = 0.0
53for epoch in range(15):
54    model.train()
55    epoch_loss = 0.0
56    for spectrograms, labels in train_loader:
57        spectrograms, labels = spectrograms.to(device), labels.to(device)
58        optimizer.zero_grad()
59        loss = loss_fn(model(spectrograms), labels)
60        loss.backward()
61        optimizer.step()
62        epoch_loss += loss.item()
63
64    model.eval()
65    correct, total = 0, 0
66    with torch.no_grad():
67        for spectrograms, labels in val_loader:
68            spectrograms, labels = spectrograms.to(device), labels.to(device)
69            predictions = model(spectrograms).argmax(dim=1)
70            correct += (predictions == labels).sum().item()
71            total += labels.size(0)
72    val_accuracy = correct / total
73
74    print(f"Epoch {epoch+1}/15: train_loss = {epoch_loss/len(train_loader):.4f}, "
75          f"val_accuracy = {val_accuracy:.2%}")
76
77    if val_accuracy > best_val_accuracy:
78        best_val_accuracy = val_accuracy
79        torch.save(model.state_dict(), "best_audio_model.pt")
80
81print(f"\nBest validation accuracy: {best_val_accuracy:.2%}")

Gotchas

⚠The classifier layer's input size (32 * 16 * 43) depends exactly on the spectrogram's height and width after two rounds of 2x2 pooling — this must be recomputed by hand, following Module 14 Lesson 3's output-size formula, whenever n_mels, hop_length, or the number of pooling layers changes; a mismatch here raises a clear shape error rather than silently producing wrong results.
⚠Dropout at 0.3 is applied specifically because this dataset (under 9,000 clips) is small enough that a CNN can memorize training examples rather than learning generalizable spectrogram patterns — Project 1's much larger 54,000-image dataset relied on transfer learning and augmentation instead, a genuinely different regularization strategy suited to a genuinely different data scale.
⚠This model, being small and trained from scratch, is expected to train much faster than Project 1's transfer learning approach — typically under 30 minutes on CPU for this dataset size.

Step 4 — Evaluating Per-Class, Same Discipline as Project 1

Exactly as Project 1's Step 4 established, the held-out test set is evaluated with per-class precision and recall, not overall accuracy alone. This matters especially here: a gun_shot misclassified as fireworks or a drilling sound misclassified as a jackhammer are meaningfully different kinds of mistakes for a real safety monitoring system, and only per-class evaluation reveals which specific confusions the model is actually making.

04_evaluate_audio_model.py

python

1import torch
2from sklearn.metrics import classification_report, confusion_matrix
3import numpy as np
4
5CLASS_NAMES = [
6    "air_conditioner", "car_horn", "children_playing", "dog_bark", "drilling",
7    "engine_idling", "gun_shot", "jackhammer", "siren", "street_music",
8]
9
10model.load_state_dict(torch.load("best_audio_model.pt", weights_only=True))
11model.eval()
12
13all_predictions = []
14all_labels = []
15
16with torch.no_grad():
17    for spectrograms, labels in test_loader:
18        spectrograms = spectrograms.to(device)
19        predictions = model(spectrograms).argmax(dim=1).cpu()
20        all_predictions.extend(predictions.numpy())
21        all_labels.extend(labels.numpy())
22
23print(classification_report(all_labels, all_predictions, target_names=CLASS_NAMES, zero_division=0))
24
25# The confusion matrix reveals WHICH classes get confused with each
26# other, not just how often each class is right or wrong overall --
27# critical for a safety system where the TYPE of confusion matters
28conf_matrix = confusion_matrix(all_labels, all_predictions)
29print("\n=== TOP CONFUSIONS (where the model most often mistakes one class for another) ===\n")
30
31confusions = []
32for i in range(len(CLASS_NAMES)):
33    for j in range(len(CLASS_NAMES)):
34        if i != j and conf_matrix[i][j] > 0:
35            confusions.append((CLASS_NAMES[i], CLASS_NAMES[j], conf_matrix[i][j]))
36
37confusions.sort(key=lambda x: x[2], reverse=True)
38for true_class, predicted_class, count in confusions[:5]:
39    print(f"  True: {true_class:>18} -> Predicted: {predicted_class:<18} ({count} times)")
40
41print("""
42For a safety monitoring system, a gun_shot confused with fireworks-
43adjacent sounds or a siren confused with a car horn are meaningfully
44different failure types than a drilling/jackhammer mix-up -- this
45breakdown is what a real deployment decision needs, not a single
46accuracy percentage.
47""")

Gotchas

⚠The confusion matrix loop above only reports confusions that actually occurred at least once — with 10 classes there are 90 possible off-diagonal confusion pairs, and most real models will only populate a handful of them meaningfully, which is itself informative about which sounds are acoustically similar.
⚠gun_shot is typically one of the rarer classes in UrbanSound8K — following Project 1's exact lesson, this is precisely the class deserving the most scrutiny in a safety-relevant deployment, not the one to deprioritize because it has fewer training examples.

Step 5 — Quantizing and Serving Audio Predictions

Following the identical Module 33 and Module 37 patterns from Project 1: export to ONNX, quantize to INT8, verify per-class accuracy holds on the quantized version, then serve through FastAPI. The only genuinely new piece here is that the API endpoint accepts an audio file upload and must run the exact same standardization and spectrogram conversion from Step 2 before the model ever sees the input, since training-serving preprocessing consistency (flagged as a critical gotcha in Project 1) applies with equal force here.

05_serve_audio_predictions.py

python

1from fastapi import FastAPI, UploadFile, File
2import torchaudio
3import torch
4import onnxruntime
5import numpy as np
6import io
7
8app = FastAPI(title="Environmental Sound Classification API")
9
10onnx_session = None
11CLASS_NAMES = [
12    "air_conditioner", "car_horn", "children_playing", "dog_bark", "drilling",
13    "engine_idling", "gun_shot", "jackhammer", "siren", "street_music",
14]
15
16TARGET_SAMPLE_RATE = 22050
17TARGET_LENGTH = TARGET_SAMPLE_RATE * 4
18
19@app.on_event("startup")
20def load_model():
21    global onnx_session
22    onnx_session = onnxruntime.InferenceSession("audio_model_quantized.onnx")
23    print("Audio model loaded, ready to classify.")
24
25def preprocess_audio(audio_bytes: bytes) -> np.ndarray:
26    waveform, original_sample_rate = torchaudio.load(io.BytesIO(audio_bytes))
27
28    # EXACTLY Step 2's standardization -- mono, resampled, fixed length
29    if waveform.shape[0] > 1:
30        waveform = waveform.mean(dim=0, keepdim=True)
31    if original_sample_rate != TARGET_SAMPLE_RATE:
32        resampler = torchaudio.transforms.Resample(original_sample_rate, TARGET_SAMPLE_RATE)
33        waveform = resampler(waveform)
34
35    current_length = waveform.shape[1]
36    if current_length < TARGET_LENGTH:
37        waveform = torch.nn.functional.pad(waveform, (0, TARGET_LENGTH - current_length))
38    else:
39        waveform = waveform[:, :TARGET_LENGTH]
40
41    mel_spectrogram = torchaudio.transforms.MelSpectrogram(
42        sample_rate=TARGET_SAMPLE_RATE, n_mels=64, n_fft=1024, hop_length=512,
43    )(waveform)
44    spectrogram_db = torchaudio.transforms.AmplitudeToDB()(mel_spectrogram)
45
46    return spectrogram_db.unsqueeze(0).numpy().astype(np.float32)   # add batch dimension
47
48@app.post("/classify")
49async def classify_sound(file: UploadFile = File(...)):
50    audio_bytes = await file.read()
51    input_array = preprocess_audio(audio_bytes)
52
53    logits = onnx_session.run(["class_logits"], {"spectrogram": input_array})[0]
54    exp_logits = np.exp(logits - logits.max())
55    probabilities = exp_logits / exp_logits.sum()
56    predicted_index = int(np.argmax(probabilities))
57    confidence = float(probabilities[0][predicted_index])
58
59    return {
60        "predicted_class": CLASS_NAMES[predicted_index],
61        "confidence": round(confidence, 4),
62        "flag_for_review": confidence < 0.6,
63    }
64
65# Run with: uvicorn 05_serve_audio_predictions:app --host 0.0.0.0 --port 8000

Gotchas

⚠This endpoint's preprocessing function is a near-exact copy of Step 2's UrbanSoundSpectrogramDataset._standardize_audio method and the spectrogram conversion — deliberately kept in lockstep, since any drift between training-time and serving-time audio processing would reproduce Project 1's exact preprocessing-mismatch warning, just for audio instead of images.
⚠A real production version of this pipeline would extract this shared preprocessing logic into one common module imported by both the training script and the serving script, rather than maintaining two separate copies as this lesson does for clarity — duplicated preprocessing code is a genuine, common source of exactly the drift this gotcha warns about.