HEAR 2021 NeurIPS Challenge

We show test scores on the open tasks. Evaluation was run using heareval 2021.0.3. Secret task evaluations are forthcoming. We currently show one submitted early entrant, and two HEAR baselines: torchcrepe and wav2vec2. Task order is randomized upon page load.

Downstream evaluation on each task involves two steps: a) computing audio embeddings and b) learning a shallow fully-connected predictor. The downstream predictor was chosen to optimize validation scores on the task's objective, measured on the validation set. Early stopping with a patience of 20 was used. Validation scores were computed every 3 training epochs (except for DCASE, where validation scores were computed every 10 epochs).

Before the creation of the leaderboard, several models and tasks were used to refine the grid and discard hyperparametere choices that were unfavorable. Randomized grid search was used for model selection, and the same 8 random grid points were used for all models. Model selection varied the number of hidden layers (1 or 2), learning rate (3.2e-3 to 1e-4), and weight initiazation (Xavier uniform or normal).

For more details, question, training logs, etc, don't hesitate to post on the discussion board or email us.

DCASE 2016 Task 2
HEAR 2021 partitions
Name Submission Event Onset FMS Segment Error Rate
MARL + Soundsensing 0.864 0.153
HEAR torchcrepe 0.629 0.338
HEAR naive wav2vec2 0.805 0.160

Adapted from DCASE 2016, Task 2 office sound event detection. Our evaluation uses different splits, so the numbers cannot be directly compared to previously published results.

Postprocessing: Segments were postprocessed using 250 ms median filtering. At each validation step, a minimum event duration of 125 or 250 ms was chosen to maximize onset-only event-based F-measure (with 200ms tolerance). Scores were computed using sed_eval.

Google Speech Commands
5hrs Full
Team Name Submission Top-1 Accuracy Top-1 Accuracy
HEAR naive wav2vec2 0.835 0.881
MARL + Soundsensing 0.651 0.740
HEAR torchcrepe 0.161 0.207

Speech commands classification. 5h or full audio used for training and validation.

NSynth Pitch
5hrs 50hrs
Team Name Submission Pitch Acc Chroma Acc Pitch Acc Chroma Acc
HEAR torchcrepe 0.860 0.926 0.900 0.956
MARL + Soundsensing 0.562 0.598 0.748 0.794
HEAR naive wav2vec2 0.420 0.456 0.666 0.714

Pitch and chroma classification of nsynth sounds. 5h or 50h audio used for training and validation.