About the Audio Classifier

How this deep learning model understands environmental sounds

← Back to classifier

How It Works

This system uses a specialized neural network architecture called a ResNet (Residual Network) to analyze environmental sounds. The model was trained on thousands of audio samples to recognize 50 different categories of sounds.

Model Architecture

ResNet-inspired CNN

  • Input: 128-band mel spectrograms
  • Layers:
    1. Initial conv (7x7, stride 2) + maxpool
    2. 4 residual blocks (64→128→256→512 channels)
    3. Global average pooling + dropout
  • Output: 50-class probabilities

Trained on ESC-50 dataset with label smoothing and mixup augmentation.

Technical Stack

  • Model: PyTorch CNN with residual blocks
  • Backend: Modal for serverless GPU inference
  • Frontend: Next.js with TypeScript and Tailwind CSS
  • Audio Processing: TorchAudio Mel spectrograms

The entire pipeline from audio upload to visualization runs in browser and cloud.

Understanding the Visualizations

Input Spectrogram

A mel spectrogram showing frequency (vertical axis) over time (horizontal axis). The color intensity represents amplitude (louder sounds are brighter). This is the initial audio representation the model processes.

Top Predictions

The model's top 3 classifications with confidence percentages. Each prediction includes an emoji representing the sound category. The highest confidence result is highlighted with a primary badge.

Convolutional Layers

Shows the main processing layers (conv1, layer1-layer4) and their internal blocks. Each layer transforms the input, with early layers detecting simple patterns and deeper layers identifying complex sound structures.

Audio Waveform

A downsampled version of your original audio waveform showing amplitude variations over time. The duration and sample rate are displayed to provide context about the audio characteristics.

All visualizations include a color scale showing the value ranges from -1 to 1.