About the Audio Classifier
How this deep learning model understands environmental sounds
← Back to classifierHow It Works
This system uses a specialized neural network architecture called a ResNet (Residual Network) to analyze environmental sounds. The model was trained on thousands of audio samples to recognize 50 different categories of sounds.
Model Architecture
ResNet-inspired CNN
- Input: 128-band mel spectrograms
- Layers:
- Initial conv (7x7, stride 2) + maxpool
- 4 residual blocks (64→128→256→512 channels)
- Global average pooling + dropout
- Output: 50-class probabilities
Trained on ESC-50 dataset with label smoothing and mixup augmentation.
Technical Stack
- Model: PyTorch CNN with residual blocks
- Backend: Modal for serverless GPU inference
- Frontend: Next.js with TypeScript and Tailwind CSS
- Audio Processing: TorchAudio Mel spectrograms
The entire pipeline from audio upload to visualization runs in browser and cloud.
Understanding the Visualizations
Input Spectrogram
A mel spectrogram showing frequency (vertical axis) over time (horizontal axis). The color intensity represents amplitude (louder sounds are brighter). This is the initial audio representation the model processes.
Top Predictions
The model's top 3 classifications with confidence percentages. Each prediction includes an emoji representing the sound category. The highest confidence result is highlighted with a primary badge.
Convolutional Layers
Shows the main processing layers (conv1, layer1-layer4) and their internal blocks. Each layer transforms the input, with early layers detecting simple patterns and deeper layers identifying complex sound structures.
Audio Waveform
A downsampled version of your original audio waveform showing amplitude variations over time. The duration and sample rate are displayed to provide context about the audio characteristics.
All visualizations include a color scale showing the value ranges from -1 to 1.