HEAR 2021 NeurIPS Challenge
We show test scores on the open tasks. Evaluation was run using heareval 2021.0.3. Secret task evaluations are forthcoming. We currently show one submitted early entrant, and two HEAR baselines: torchcrepe and wav2vec2. Task order is randomized upon page load.
Downstream evaluation on each task involves two steps: a) computing audio embeddings and b) learning a shallow fully-connected predictor. The downstream predictor was chosen to optimize validation scores on the task's objective, measured on the validation set. Early stopping with a patience of 20 was used. Validation scores were computed every 3 training epochs (except for DCASE, where validation scores were computed every 10 epochs).
Before the creation of the leaderboard, several models and tasks were used to refine the grid and discard hyperparametere choices that were unfavorable. Randomized grid search was used for model selection, and the same 8 random grid points were used for all models. Model selection varied the number of hidden layers (1 or 2), learning rate (3.2e-3 to 1e-4), and weight initiazation (Xavier uniform or normal).
|DCASE 2016 Task 2|
|HEAR 2021 partitions|
|Name||Submission||Event Onset FMS||Segment Error Rate|
|MARL + Soundsensing||0.864||0.153|
Adapted from DCASE 2016, Task 2 office sound event detection. Our evaluation uses different splits, so the numbers cannot be directly compared to previously published results.
Postprocessing: Segments were postprocessed using 250 ms median filtering. At each validation step, a minimum event duration of 125 or 250 ms was chosen to maximize onset-only event-based F-measure (with 200ms tolerance). Scores were computed using sed_eval.
|Google Speech Commands|
|Team Name||Submission||Top-1 Accuracy||Top-1 Accuracy|
|MARL + Soundsensing||0.651||0.740|
Speech commands classification. 5h or full audio used for training and validation.
|Team Name||Submission||Pitch Acc||Chroma Acc||Pitch Acc||Chroma Acc|
|MARL + Soundsensing||0.562||0.598||0.748||0.794|
Pitch and chroma classification of nsynth sounds. 5h or 50h audio used for training and validation.