Ecoacoustics · Machine Learning

Anuran Audio Classification and Segmentation

Transformer-based multi-label classification on real-world passive acoustic monitoring data — optimized for class imbalance, overlapping calls and noisy soundscapes.

PyTorch Transformers AST ResNet Torchaudio MelSpectrogram Multi-label F1 UMAP Observable Visualizations

This project investigates whether a Transformer-based Audio Spectrogram Transformer (AST) can improve frog species recognition and call segmentation signals in Neotropical soundscapes. I focus on reliability under long-tailed class distributions and evaluate whether model outputs recover ecologically plausible temporal calling patterns (diel activity).

Passive Acoustic Monitoring (PAM) can capture biodiversity at scale, but the data volume makes manual expert review impractical. In species-rich environments, recordings contain overlapping calls, strong background noise (rain, insects), and a long tail of rare species. The goal of this project is to build a better machine listening approach that remains useful under these real-world constraints — especially for rare species where conservation value is highest. This work is inspired by and builds upon the AnuraSet dataset introduced by Cañas et al. (2023), which provides a comprehensive benchmark for Neotropical anuran call identification. Their work highlighted the challenges of multi-label classification in passive acoustic monitoring contexts, particularly with class imbalance and overlapping vocalizations — challenges this project directly addresses through transformer-based architectures.

Research Questions

RQ1: Does AST outperform a ResNet baseline under long-tailed class imbalance, especially for rare species?

RQ2: What structure does AST learn in embedding space, and are confusions acoustically interpretable?

RQ3: Do predictions recover coherent diel calling patterns across sites (ecological plausibility)?

Tools / Approaches

AST: self-attention models longer-range time–frequency dependencies and subtle call structure.

ResNet baseline: establishes a strong CNN reference point for fair comparison.

Per-class views: imbalance curves + F1 distributions expose rare-class behavior (not hidden by macro averages).

Embeddings + confusions: validate learned representation and failure modes (trust + debugging).

Temporal plots: translate predictions into behavior signals (useful for monitoring workflows).

Audio is converted into time–frequency representations and evaluated with species-level metrics and model
behavior analysis. The workflow emphasizes reproducible evaluation and transparent diagnostics:

Data: AnuraSet-style passive recordings with strong/weak labels and overlapping calls.

Preprocessing: spectrogram transforms; consistent windowing for multi-label segments.

Modeling: AST vs ResNet baseline; multi-label decisioning for overlapping species.

Evaluation: per-class F1, frequency sensitivity, embedding geometry, confusion structure.

Behavior: time-of-day heatmaps/curves to validate ecological plausibility.

Training and Validation

All audio clips were resampled to 16 kHz (standard sampling rate for speech) and converted to a single channel to ensure consistency. Each recording was trimmed or zero-padded to a fixed length of 3 seconds so the model always received the same input size.

The dataset was split into training and validation sets using a standard 85/15 split. The model was trained using a batch size of 16 and a learning rate of 0.0001. Training was allowed to run for up to 30 epochs, but early stopping was applied based on validation F1 score - to avoid overfitting.

Best checkpoint is from epoch 28, best_micro_f1=0.9638. Overall, across all species and predictions, the model is 96% accurate in balancing false positives and false negatives. Original AnuraSet paper reports a best macro-F1 of about 38% using a ResNet-152 CNN model.

Results and Visualizations

The visualizations below answer the research questions in sequence: performance under imbalance, representation/error structure, and time-of-day behavior patterns derived from predictions. The layout is intentionally minimal to keep attention on signal, failure modes, and ecological interpretability.

a. Model Performance and Class Imbalance

First, I verify that AST improves over a ResNet baseline in the regimes that matter most: rare species and noisy conditions. These views highlight how performance shifts across the species spectrum.

1.1 Class Imbalance vs Performance

Per-species performance plotted against class frequency. AST maintains higher performance for rare species than a CNN baseline, suggesting better generalization under limited samples — critical for biodiversity datasets.

1.2 Per-Species F1 Distribution

The F1 strip plot shows how performance is spread across species. AST tends to compress the long tail of low-performing species upward, pulling many “hard” species into a usable prediction range. This indicates that the Transformer captures richer spectral–temporal structure than the ResNet baseline.

b. Learned Representation and Error Structure

To understand why AST works better, I inspect embedding geometry and confusion patterns. Meaningful structure here suggests the model learned consistent acoustic signatures rather than memorizing labels.

2.1 Species Embedding Map (Voronoi)

Voronoi regions expose AST’s latent organization. Species with similar calls fall into neighboring cells, while acoustically distinct ones occupy separate regions — a useful sanity check on representation quality.

2.2 Confusion Matrix

Confusions concentrate among acoustically similar species instead of scattering randomly. This makes failure modes interpretable and supports trust for ecological monitoring use cases. This matrix shows which species are being mistaken for which. AST’s errors are concentrated among species with similar call structures, instead of being scattered randomly.

2.3 UMAP Projection of AST Embeddings

This UMAP projection visualizes the global shape of the AST embedding space. I see reasonably compact clusters for many taxa, with smooth transitions where calls are legitimately similar. Compared to the ResNet baseline, AST’s space looks more structured, suggesting the model has discovered consistent species-level signatures across sites and noise conditions.

c. Temporal Activity Patterns and Segmentation Behavior

Finally, I translate predictions into time-of-day behavior. A useful ecoacoustic model should recover coherent diel rhythms, enabling downstream monitoring, comparisons across sites, and potential alerting workflows.

3.1 Multi-Site Activity Heatmap with Filters

This combined view shows activity across sites and species, controlled by interactive filters. For a chosen site–species pair, the heatmap reveals bursts of calling activity over time. The patterns I see are coherent and stable rather than noisy, which suggests that AST-based segmentation is capturing genuine ecological signal that could be used for monitoring or automatic alerting.

Lorem Ipsum...

3.2 Species Time-of-Day Heatmap

This heatmap focuses on a single species at a time to show how its calling intensity changes over the 24-hour cycle. As I scroll through different taxa, clear diel signatures appear: strongly nocturnal species, dawn–dusk specialists, and more evenly active species. The fact that these patterns emerge directly from model predictions indicates that AST has internalized meaningful temporal structure.

5.3.3 Time-of-Day Line Plot

The time-of-day line plot smooths those heatmap patterns into a simple curve, making peaks and troughs obvious at a glance—for example, sharp nocturnal peaks or subdued mid-day activity. What I find important is that these curves rarely look flat or random: they line up with expected ecological rhythms, which is a strong signal that the AST model is learning more than just label noise.

Gradio Set Up

I have set up a demo where users can upload or record any frog audio, and the app runs the trained AST model on it using a sliding window over time. For each window, the model predicts the probability of every species and then groups high-probability windows into continuous segments.The interface shows three things at once:

A waveform with colored regions where species are detected, a log-mel spectrogram for the same audio, and probability curves over time for the top predicted species.

Sliders let the user adjust window length, hop size, detection threshold, and how many species to display. This turns the model from a research experiment into a practical tool for exploring real recordings and checking when and which frogs are calling.

Github

Observable Notebook

Future Developments and Afterthoughts

The current work demonstrates strong diagnostic signal: better rare-class behavior, interpretable confusions, and coherent diel patterns. The next step is moving from analysis to a more operational ecoacoustics pipeline.

Calibration: threshold tuning and reliability curves for multi-label decisioning.
Segmentation: event-level detection (onset/offset) and evaluation against strong labels.
Robustness: augmentation and domain shift tests across sites, seasons, and noise regimes.
Efficiency: streaming inference + batching for scalable PAM deployments.
Monitoring: automated dashboards for activity trends and anomaly detection.

References

Cañas, J. S., Toro-Gómez, M. P., Moreira Sugai, L. S., Benítez Restrepo, H. D., Rudas, J., Posso Bautista, B., et al. (2023). “A Dataset for Benchmarking Neotropical Anuran Calls Identification in Passive Acoustic Monitoring.” Scientific Data, 10 (771). https://doi.org/10.1038/s41597-023-02666-2