HEAR 2021 NeurIPS Challenge
Results

These are the final evaluation scores for HEAR 2021 [CSV]. Downstream evaluation on each task involves two steps: a) computing audio embeddings and b) learning a shallow fully-connected predictor. The downstream predictor was chosen to optimize validation scores on the task's objective, measured on the validation set.

We used a V100 GPU for each open task and an A100 GPU for each secret task. We do not publish scores for submissions on tasks that exceeded 20 GPU-hours on either downstream step, or submissions that exceeded GPU memory.

To find out more information about the teams and their submissions please see the team information page. Recordings of team presentations at NeurIPS 2021 are available for viewing.


Open Tasks:

Secret Tasks:


Speech Commands

Speech Commands

Team Name Submission Accuracy
ID56-SSL kwmlp 0.978
NTU-GURA fusion_wav2vec2 0.969
NTU-GURA fusion_cat_xwc_time 0.968
NTU-GURA fusion_cat_xwc 0.968
NTU-GURA fusion_hubert_xlarge 0.957
NTU-GURA hubert_xlarge 0.954
NTU-GURA avg_hubert_wav2vec2 0.952
Logitech AI serab_byols 0.948
NTU-GURA cat_xwc 0.943
NTU-GURA cat_hubert_wav2vec2 0.929
NTU-GURA cat_wc 0.919
NTU-GURA avg_xwc 0.881
HEAR wav2vec2 0.879
NTU-GURA avg_hubert_crepe 0.823
MARL + Soundsensing openl3_hear 0.763
RedRice efficient_latent 0.676
CP-JKU base 0.639
CP-JKU base2level 0.639
CP-JKU base2levelmel 0.639
Stellenbosch LSL audio_dbert 0.630
CVSSP panns_hear 0.618
ibkuroyagi hearline 0.580
ID56-SSL audiomlp 0.535
UDONS embed 0.531
AMAAI my_model 0.523
AMAAI Lab wav2vec2_ddsp 0.490
Soundsensing yamnet_hear 0.410
Descript/MARL wav2clip_hear 0.347
HEAR torchcrepe 0.211

Speech Commands 5H

Team Name Submission Accuracy
ID56-SSL kwmlp 0.976
NTU-GURA fusion_cat_xwc 0.961
NTU-GURA fusion_wav2vec2 0.957
NTU-GURA hubert_xlarge 0.953
NTU-GURA fusion_cat_xwc_time 0.951
NTU-GURA fusion_hubert_xlarge 0.947
NTU-GURA cat_hubert_wav2vec2 0.936
NTU-GURA cat_xwc 0.927
Logitech AI serab_byols 0.914
NTU-GURA avg_hubert_wav2vec2 0.907
NTU-GURA cat_wc 0.885
HEAR wav2vec2 0.838
NTU-GURA avg_xwc 0.822
AMAAI my_model 0.744
NTU-GURA avg_hubert_crepe 0.737
CP-JKU base 0.681
CP-JKU base2level 0.681
CP-JKU base2levelmel 0.681
MARL + Soundsensing openl3_hear 0.680
RedRice efficient_latent 0.573
Stellenbosch LSL audio_dbert 0.566
CVSSP panns_hear 0.560
AMAAI Lab wav2vec2_ddsp 0.487
UDONS embed 0.479
ibkuroyagi hearline 0.478
ID56-SSL audiomlp 0.392
Descript/MARL wav2clip_hear 0.316
Soundsensing yamnet_hear 0.289
HEAR torchcrepe 0.180

NSynth Pitch

NSynth Pitch 5h

Team Name Submission Pitch Acc Chroma Acc
NTU-GURA avg_hubert_crepe 0.878 0.936
HEAR torchcrepe 0.870 0.926
NTU-GURA cat_wc 0.868 0.930
NTU-GURA cat_xwc 0.866 0.930
NTU-GURA avg_xwc 0.862 0.926
NTU-GURA fusion_cat_xwc_time 0.854 0.900
NTU-GURA fusion_cat_xwc 0.846 0.892
UDONS embed 0.692 0.772
MARL + Soundsensing openl3_hear 0.560 0.590
Stellenbosch LSL audio_dbert 0.524 0.600
ID56-SSL kwmlp 0.440 0.480
HEAR wav2vec2 0.402 0.438
Logitech AI serab_byols 0.396 0.414
ibkuroyagi hearline 0.394 0.464
ID56-SSL audiomlp 0.386 0.452
NTU-GURA fusion_hubert_xlarge 0.382 0.410
NTU-GURA fusion_wav2vec2 0.330 0.354
CP-JKU base 0.256 0.290
CP-JKU base2level 0.256 0.290
CP-JKU base2levelmel 0.256 0.290
Descript/MARL wav2clip_hear 0.230 0.254
NTU-GURA avg_hubert_wav2vec2 0.198 0.220
NTU-GURA cat_hubert_wav2vec2 0.198 0.230
NTU-GURA hubert_xlarge 0.184 0.204
AMAAI my_model 0.182 0.202
AMAAI Lab wav2vec2_ddsp 0.172 0.190
RedRice efficient_latent 0.168 0.184
Soundsensing yamnet_hear 0.158 0.190
CVSSP panns_hear 0.148 0.162

NSynth Pitch 50h

Team Name Submission Pitch Acc Chroma Acc
HEAR torchcrepe 0.900 0.957
NTU-GURA cat_wc 0.899 0.955
NTU-GURA cat_xwc 0.897 0.955
NTU-GURA avg_hubert_crepe 0.897 0.956
NTU-GURA avg_xwc 0.896 0.955
NTU-GURA fusion_cat_xwc_time 0.891 0.950
NTU-GURA fusion_cat_xwc 0.885 0.945
UDONS embed 0.795 0.863
Stellenbosch LSL audio_dbert 0.737 0.816
MARL + Soundsensing openl3_hear 0.731 0.773
Logitech AI serab_byols 0.712 0.742
NTU-GURA fusion_hubert_xlarge 0.688 0.729
HEAR wav2vec2 0.653 0.706
ID56-SSL audiomlp 0.608 0.668
NTU-GURA fusion_wav2vec2 0.606 0.646
ID56-SSL kwmlp 0.605 0.648
ibkuroyagi hearline 0.589 0.656
CP-JKU base 0.541 0.566
CP-JKU base2level 0.541 0.566
CP-JKU base2levelmel 0.541 0.566
Descript/MARL wav2clip_hear 0.439 0.464
NTU-GURA hubert_xlarge 0.429 0.458
NTU-GURA cat_hubert_wav2vec2 0.428 0.460
NTU-GURA avg_hubert_wav2vec2 0.415 0.444
RedRice efficient_latent 0.391 0.418
Soundsensing yamnet_hear 0.321 0.345
AMAAI Lab wav2vec2_ddsp 0.303 0.332
CVSSP panns_hear 0.301 0.323
AMAAI my_model 0.176 0.210

DCASE 2016 Task 2

Team Name Submission Event Onset FMS Segment Error Rate
CP-JKU base2levelmel 0.925 0.099
CP-JKU base2level 0.913 0.102
MARL + Soundsensing openl3_hear 0.833 0.174
NTU-GURA fusion_cat_xwc 0.826 0.145
NTU-GURA fusion_cat_xwc_time 0.826 0.145
NTU-GURA fusion_hubert_xlarge 0.826 0.150
NTU-GURA fusion_wav2vec2 0.798 0.163
RedRice efficient_latent 0.790 0.231
CP-JKU base 0.788 0.193
NTU-GURA cat_xwc 0.681 0.287
UDONS embed 0.668 0.310
HEAR wav2vec2 0.663 0.266
Logitech AI serab_byols 0.642 0.382
NTU-GURA avg_xwc 0.624 0.333
NTU-GURA avg_hubert_crepe 0.610 0.337
NTU-GURA cat_wc 0.585 0.392
NTU-GURA hubert_xlarge 0.584 0.394
ID56-SSL kwmlp 0.518 0.527
AMAAI my_model 0.510 0.463
AMAAI Lab wav2vec2_ddsp 0.507 0.502
HEAR torchcrepe 0.504 0.493
NTU-GURA cat_hubert_wav2vec2 0.452 0.574
NTU-GURA avg_hubert_wav2vec2 0.417 0.582
Stellenbosch LSL audio_dbert 0.246 0.747
ID56-SSL audiomlp 0.052 0.965
Soundsensing yamnet_hear 0.008 1.003
CVSSP panns_hear 0.000 1.000
Descript/MARL wav2clip_hear 0.000 1.000

Beehive States

Team Name Submission AUCROC Accuracy
UDONS embed 0.878 0.684
Descript/MARL wav2clip_hear 0.770 0.684
ID56-SSL kwmlp 0.760 0.552
Stellenbosch LSL audio_dbert 0.697 0.582
MARL + Soundsensing openl3_hear 0.604 0.498
HEAR torchcrepe 0.593 0.535
Logitech AI serab_byols 0.549 0.524
ID56-SSL audiomlp 0.535 0.500
RedRice efficient_latent 0.533 0.568
Soundsensing yamnet_hear 0.466 0.479
CVSSP panns_hear 0.446 0.429

Beijing Opera Percussion

Team Name Submission Accuracy
MARL + Soundsensing openl3_hear 0.975
NTU-GURA fusion_cat_xwc 0.966
CP-JKU base 0.966
CP-JKU base2level 0.966
CP-JKU base2levelmel 0.966
NTU-GURA fusion_cat_xwc_time 0.962
RedRice efficient_latent 0.953
Logitech AI serab_byols 0.953
NTU-GURA fusion_hubert_xlarge 0.949
NTU-GURA fusion_wav2vec2 0.945
NTU-GURA avg_xwc 0.945
NTU-GURA hubert_xlarge 0.945
NTU-GURA avg_hubert_wav2vec2 0.941
Soundsensing yamnet_hear 0.941
Descript/MARL wav2clip_hear 0.936
NTU-GURA cat_xwc 0.936
NTU-GURA cat_hubert_wav2vec2 0.936
NTU-GURA avg_hubert_crepe 0.932
HEAR torchcrepe 0.928
ibkuroyagi hearline 0.928
UDONS embed 0.928
NTU-GURA cat_wc 0.920
Stellenbosch LSL audio_dbert 0.919
CVSSP panns_hear 0.911
ID56-SSL kwmlp 0.911
HEAR wav2vec2 0.907
AMAAI my_model 0.826
ID56-SSL audiomlp 0.728
AMAAI Lab wav2vec2_ddsp 0.635

CREMA-D

Due to a preprocessing bug, we retract and will publish corrected scores for this task.

ESC-50

Team Name Submission Accuracy
CP-JKU base 0.947
CP-JKU base2level 0.947
CP-JKU base2levelmel 0.947
RedRice efficient_latent 0.935
CVSSP panns_hear 0.909
Soundsensing yamnet_hear 0.838
Logitech AI serab_byols 0.805
Descript/MARL wav2clip_hear 0.759
MARL + Soundsensing openl3_hear 0.751
NTU-GURA fusion_hubert_xlarge 0.743
NTU-GURA fusion_cat_xwc 0.734
NTU-GURA fusion_wav2vec2 0.695
NTU-GURA fusion_cat_xwc_time 0.653
NTU-GURA hubert_xlarge 0.603
NTU-GURA avg_hubert_wav2vec2 0.587
NTU-GURA cat_hubert_wav2vec2 0.586
HEAR wav2vec2 0.561
Stellenbosch LSL audio_dbert 0.532
NTU-GURA cat_xwc 0.511
AMAAI my_model 0.511
NTU-GURA avg_xwc 0.450
NTU-GURA avg_hubert_crepe 0.437
ibkuroyagi hearline 0.424
UDONS embed 0.401
ID56-SSL kwmlp 0.367
NTU-GURA cat_wc 0.343
HEAR torchcrepe 0.300
AMAAI Lab wav2vec2_ddsp 0.280
ID56-SSL audiomlp 0.161

FSD50k

Team Name Submission mAP d'
CP-JKU base 0.640 2.643
RedRice efficient_latent 0.607 2.538
CP-JKU base2levelmel 0.558 2.312
CP-JKU base2level 0.537 2.292
Logitech AI serab_byols 0.509 2.218
MARL + Soundsensing openl3_hear 0.447 2.117
NTU-GURA fusion_cat_xwc 0.420 2.052
NTU-GURA fusion_hubert_xlarge 0.413 2.079
NTU-GURA fusion_wav2vec2 0.403 2.017
NTU-GURA fusion_cat_xwc_time 0.374 1.838
Descript/MARL wav2clip_hear 0.362 1.980
NTU-GURA cat_hubert_wav2vec2 0.323 1.857
NTU-GURA avg_hubert_wav2vec2 0.318 1.868
NTU-GURA cat_xwc 0.314 1.847
NTU-GURA hubert_xlarge 0.314 1.858
AMAAI Lab wav2vec2_ddsp 0.268 1.820
NTU-GURA avg_xwc 0.264 1.684
Stellenbosch LSL audio_dbert 0.263 1.647
NTU-GURA avg_hubert_crepe 0.253 1.683
NTU-GURA cat_wc 0.209 1.479
AMAAI my_model 0.198 1.666
ID56-SSL kwmlp 0.187 1.511
ibkuroyagi hearline 0.178 1.399
HEAR torchcrepe 0.159 1.456
HEAR wav2vec2 0.116 0.775
ID56-SSL audiomlp 0.086 0.981

Gunshot Triangulation

Team Name Submission Accuracy
NTU-GURA fusion_wav2vec2 0.967
MARL + Soundsensing openl3_hear 0.949
CP-JKU base 0.940
CP-JKU base2level 0.940
CP-JKU base2levelmel 0.940
NTU-GURA fusion_cat_xwc 0.935
ID56-SSL kwmlp 0.932
NTU-GURA hubert_xlarge 0.932
NTU-GURA fusion_hubert_xlarge 0.929
Descript/MARL wav2clip_hear 0.929
NTU-GURA fusion_cat_xwc_time 0.905
NTU-GURA cat_xwc 0.881
RedRice efficient_latent 0.878
NTU-GURA cat_hubert_wav2vec2 0.869
UDONS embed 0.866
HEAR torchcrepe 0.863
NTU-GURA avg_xwc 0.857
AMAAI my_model 0.857
Logitech AI serab_byols 0.857
HEAR wav2vec2 0.848
NTU-GURA avg_hubert_crepe 0.845
NTU-GURA cat_wc 0.833
Stellenbosch LSL audio_dbert 0.810
NTU-GURA avg_hubert_wav2vec2 0.810
CVSSP panns_hear 0.798
ID56-SSL audiomlp 0.786
Soundsensing yamnet_hear 0.732
AMAAI Lab wav2vec2_ddsp 0.583
ibkuroyagi hearline 0.518

GTZAN Genre

Due to a preprocessing bug, we retract and will publish corrected scores for this task.

GTZAN Music Speech

Due to a preprocessing bug, we retract and will publish corrected scores for this task.

LibriCount

Team Name Submission Accuracy
Logitech AI serab_byols 0.785
NTU-GURA fusion_cat_xwc 0.697
HEAR wav2vec2 0.692
NTU-GURA fusion_hubert_xlarge 0.683
CP-JKU base 0.660
CP-JKU base2level 0.660
CP-JKU base2levelmel 0.660
NTU-GURA fusion_cat_xwc_time 0.659
NTU-GURA fusion_wav2vec2 0.653
Soundsensing yamnet_hear 0.653
CVSSP panns_hear 0.652
RedRice efficient_latent 0.651
NTU-GURA hubert_xlarge 0.646
MARL + Soundsensing openl3_hear 0.641
NTU-GURA cat_xwc 0.639
NTU-GURA avg_xwc 0.631
NTU-GURA cat_hubert_wav2vec2 0.627
NTU-GURA avg_hubert_wav2vec2 0.623
NTU-GURA avg_hubert_crepe 0.622
Stellenbosch LSL audio_dbert 0.584
NTU-GURA cat_wc 0.569
Descript/MARL wav2clip_hear 0.528
AMAAI Lab wav2vec2_ddsp 0.510
HEAR torchcrepe 0.499
ibkuroyagi hearline 0.498
UDONS embed 0.488
ID56-SSL kwmlp 0.451
AMAAI my_model 0.432
ID56-SSL audiomlp 0.366

MAESTRO 5h

Team Name Submission Note Onset FMS Note Onset w/ Offset FMS
NTU-GURA cat_xwc 0.4691 0.1552
NTU-GURA cat_wc 0.4626 0.1496
NTU-GURA avg_hubert_crepe 0.4624 0.1481
NTU-GURA avg_xwc 0.4599 0.1486
NTU-GURA fusion_cat_xwc 0.4413 0.1606
NTU-GURA fusion_cat_xwc_time 0.4413 0.1606
HEAR torchcrepe 0.4011 0.1537
UDONS embed 0.2386 0.0756
NTU-GURA fusion_hubert_xlarge 0.1657 0.0535
NTU-GURA fusion_wav2vec2 0.1110 0.0337
ID56-SSL kwmlp 0.0648 0.0210
HEAR wav2vec2 0.0329 0.0101
ID56-SSL audiomlp 0.0237 0.0095
MARL + Soundsensing openl3_hear 0.0165 0.0023
Logitech AI serab_byols 0.0079 0.0011
NTU-GURA hubert_xlarge 0.0068 0.0018
NTU-GURA avg_hubert_wav2vec2 0.0044 0.0013
NTU-GURA cat_hubert_wav2vec2 0.0035 0.0007
AMAAI my_model 0.0026 0.0006
ibkuroyagi hearline 0.0016 0.0005
AMAAI Lab wav2vec2_ddsp 0.0003 0.0001
RedRice efficient_latent 0.0002 0.0000
Stellenbosch LSL audio_dbert 0.0002 0.0000
CVSSP panns_hear 0.0000 0.0000
Descript/MARL wav2clip_hear 0.0000 0.0000
Soundsensing yamnet_hear 0.0000 0.0000

Mridingham Stroke and Tonic

Mridingham Stroke

Team Name Submission Accuracy
NTU-GURA fusion_cat_xwc_time 0.975
NTU-GURA fusion_hubert_xlarge 0.974
Logitech AI serab_byols 0.973
NTU-GURA fusion_cat_xwc 0.972
ID56-SSL kwmlp 0.969
MARL + Soundsensing openl3_hear 0.967
CP-JKU base 0.965
CP-JKU base2level 0.965
CP-JKU base2levelmel 0.965
NTU-GURA fusion_wav2vec2 0.962
NTU-GURA hubert_xlarge 0.953
NTU-GURA cat_hubert_wav2vec2 0.951
RedRice efficient_latent 0.949
Descript/MARL wav2clip_hear 0.947
NTU-GURA avg_hubert_wav2vec2 0.946
HEAR wav2vec2 0.943
CVSSP panns_hear 0.939
NTU-GURA cat_xwc 0.938
NTU-GURA avg_hubert_crepe 0.917
NTU-GURA avg_xwc 0.914
ibkuroyagi hearline 0.909
HEAR torchcrepe 0.898
NTU-GURA cat_wc 0.898
ID56-SSL audiomlp 0.893
UDONS embed 0.880
Stellenbosch LSL audio_dbert 0.835
AMAAI my_model 0.759
AMAAI Lab wav2vec2_ddsp 0.653

Mridingham Tonic

Team Name Submission Accuracy
ID56-SSL kwmlp 0.942
MARL + Soundsensing openl3_hear 0.937
Logitech AI serab_byols 0.928
NTU-GURA fusion_cat_xwc_time 0.924
NTU-GURA fusion_cat_xwc 0.923
NTU-GURA fusion_hubert_xlarge 0.909
NTU-GURA cat_xwc 0.859
NTU-GURA hubert_xlarge 0.850
RedRice efficient_latent 0.843
NTU-GURA fusion_wav2vec2 0.838
NTU-GURA avg_xwc 0.837
NTU-GURA avg_hubert_crepe 0.831
Descript/MARL wav2clip_hear 0.829
HEAR wav2vec2 0.828
CVSSP panns_hear 0.824
HEAR torchcrepe 0.824
NTU-GURA cat_wc 0.823
CP-JKU base 0.819
CP-JKU base2level 0.819
CP-JKU base2levelmel 0.819
NTU-GURA cat_hubert_wav2vec2 0.802
NTU-GURA avg_hubert_wav2vec2 0.799
ibkuroyagi hearline 0.783
ID56-SSL audiomlp 0.778
UDONS embed 0.730
Stellenbosch LSL audio_dbert 0.685
AMAAI my_model 0.415
AMAAI Lab wav2vec2_ddsp 0.357

Vocal Imitations

Team Name Submission mAP Accuracy
NTU-GURA fusion_cat_xwc_time 0.215 0.196
NTU-GURA fusion_cat_xwc 0.197 0.185
NTU-GURA fusion_hubert_xlarge 0.185 0.172
CP-JKU base 0.182 0.162
CP-JKU base2level 0.182 0.162
CP-JKU base2levelmel 0.182 0.162
NTU-GURA fusion_wav2vec2 0.174 0.160
NTU-GURA cat_hubert_wav2vec2 0.166 0.157
NTU-GURA avg_hubert_wav2vec2 0.165 0.153
Stellenbosch LSL audio_dbert 0.161 0.145
Logitech AI serab_byols 0.160 0.148
NTU-GURA hubert_xlarge 0.154 0.142
RedRice efficient_latent 0.138 0.134
CVSSP panns_hear 0.127 0.116
NTU-GURA cat_xwc 0.111 0.109
Soundsensing yamnet_hear 0.085 0.080
NTU-GURA avg_xwc 0.084 0.082
Descript/MARL wav2clip_hear 0.083 0.073
HEAR wav2vec2 0.080 0.068
NTU-GURA avg_hubert_crepe 0.079 0.079
MARL + Soundsensing openl3_hear 0.078 0.060
NTU-GURA cat_wc 0.076 0.076
AMAAI Lab wav2vec2_ddsp 0.072 0.052
UDONS embed 0.068 0.062
AMAAI my_model 0.064 0.038
ID56-SSL kwmlp 0.056 0.049
HEAR torchcrepe 0.051 0.050
ibkuroyagi hearline 0.040 0.038
ID56-SSL audiomlp 0.038 0.036

Vox Lingua Top 10

Team Name Submission Accuracy
NTU-GURA fusion_cat_xwc 0.720
NTU-GURA fusion_hubert_xlarge 0.714
NTU-GURA cat_hubert_wav2vec2 0.706
NTU-GURA fusion_wav2vec2 0.706
NTU-GURA avg_hubert_wav2vec2 0.690
NTU-GURA hubert_xlarge 0.637
NTU-GURA fusion_cat_xwc_time 0.629
HEAR wav2vec2 0.493
NTU-GURA cat_xwc 0.460
Logitech AI serab_byols 0.458
MARL + Soundsensing openl3_hear 0.331
NTU-GURA avg_xwc 0.321
NTU-GURA cat_wc 0.310
NTU-GURA avg_hubert_crepe 0.293
CP-JKU base2levelmel 0.259
CP-JKU base 0.259
CP-JKU base2level 0.259
RedRice efficient_latent 0.255
Stellenbosch LSL audio_dbert 0.246
CVSSP panns_hear 0.244
UDONS embed 0.224
Soundsensing yamnet_hear 0.202
Descript/MARL wav2clip_hear 0.192
ID56-SSL kwmlp 0.181
ibkuroyagi hearline 0.180
AMAAI my_model 0.169
HEAR torchcrepe 0.142
ID56-SSL audiomlp 0.129
AMAAI Lab wav2vec2_ddsp 0.106