HEAR 2021 NeurIPS Challenge
Results
These are the final evaluation scores for HEAR 2021 [CSV].
Downstream evaluation on each task involves two
steps: a) computing audio
embeddings and b) learning a shallow fully-connected
predictor. The downstream predictor was chosen to optimize
validation scores on the task's objective, measured on the
validation set.
We used a V100 GPU for each open task and an A100
GPU for each secret task. We do not publish scores
for submissions on tasks that exceeded 20 GPU-hours
on either downstream step, or submissions that exceeded
GPU memory.
(An older leaderboard, available here, had incorrect scores for all TFDS datasets, which were retracted and updated in the updated leaderboard, above.)
You can read about the evaluation in detail in our upcoming
journal article, to appear in the
PMLR Issue on NeurIPS 2021 Competitions.
Open Tasks:
Secret Tasks:
Speech Commands
Speech Commands
Team Name |
Submission |
Accuracy |
ID56-SSL |
kwmlp |
0.978 |
NTU-GURA |
fusion_wav2vec2 |
0.969 |
NTU-GURA |
fusion_cat_xwc_time |
0.968 |
NTU-GURA |
fusion_cat_xwc |
0.968 |
NTU-GURA |
fusion_hubert_xlarge |
0.957 |
NTU-GURA |
hubert_xlarge |
0.954 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.952 |
Logitech AI |
serab_byols |
0.948 |
NTU-GURA |
cat_xwc |
0.943 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.929 |
NTU-GURA |
cat_wc |
0.919 |
NTU-GURA |
avg_xwc |
0.881 |
HEAR |
wav2vec2 |
0.879 |
NTU-GURA |
avg_hubert_crepe |
0.823 |
MARL + Soundsensing |
openl3_hear |
0.763 |
RedRice |
efficient_latent |
0.676 |
CP-JKU |
base |
0.639 |
CP-JKU |
base2level |
0.639 |
CP-JKU |
base2levelmel |
0.639 |
Stellenbosch LSL |
audio_dbert |
0.630 |
CVSSP |
panns_hear |
0.618 |
ibkuroyagi |
hearline |
0.580 |
ID56-SSL |
audiomlp |
0.535 |
UDONS |
embed |
0.531 |
AMAAI |
my_model |
0.523 |
AMAAI Lab |
wav2vec2_ddsp |
0.490 |
Soundsensing |
yamnet_hear |
0.410 |
Descript/MARL |
wav2clip_hear |
0.347 |
HEAR |
torchcrepe |
0.211 |
Speech Commands 5H
Team Name |
Submission |
Accuracy |
ID56-SSL |
kwmlp |
0.976 |
NTU-GURA |
fusion_cat_xwc |
0.961 |
NTU-GURA |
fusion_wav2vec2 |
0.957 |
NTU-GURA |
hubert_xlarge |
0.953 |
NTU-GURA |
fusion_cat_xwc_time |
0.951 |
NTU-GURA |
fusion_hubert_xlarge |
0.947 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.936 |
NTU-GURA |
cat_xwc |
0.927 |
Logitech AI |
serab_byols |
0.914 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.907 |
NTU-GURA |
cat_wc |
0.885 |
HEAR |
wav2vec2 |
0.838 |
NTU-GURA |
avg_xwc |
0.822 |
AMAAI |
my_model |
0.744 |
NTU-GURA |
avg_hubert_crepe |
0.737 |
CP-JKU |
base |
0.681 |
CP-JKU |
base2level |
0.681 |
CP-JKU |
base2levelmel |
0.681 |
MARL + Soundsensing |
openl3_hear |
0.680 |
RedRice |
efficient_latent |
0.573 |
Stellenbosch LSL |
audio_dbert |
0.566 |
CVSSP |
panns_hear |
0.560 |
AMAAI Lab |
wav2vec2_ddsp |
0.487 |
UDONS |
embed |
0.479 |
ibkuroyagi |
hearline |
0.478 |
ID56-SSL |
audiomlp |
0.392 |
Descript/MARL |
wav2clip_hear |
0.316 |
Soundsensing |
yamnet_hear |
0.289 |
HEAR |
torchcrepe |
0.180 |
NSynth Pitch
NSynth Pitch 5h
Team Name |
Submission |
Pitch Acc |
Chroma Acc |
NTU-GURA |
avg_hubert_crepe |
0.878 |
0.936 |
HEAR |
torchcrepe |
0.870 |
0.926 |
NTU-GURA |
cat_wc |
0.868 |
0.930 |
NTU-GURA |
cat_xwc |
0.866 |
0.930 |
NTU-GURA |
avg_xwc |
0.862 |
0.926 |
NTU-GURA |
fusion_cat_xwc_time |
0.854 |
0.900 |
NTU-GURA |
fusion_cat_xwc |
0.846 |
0.892 |
UDONS |
embed |
0.692 |
0.772 |
MARL + Soundsensing |
openl3_hear |
0.560 |
0.590 |
Stellenbosch LSL |
audio_dbert |
0.524 |
0.600 |
ID56-SSL |
kwmlp |
0.440 |
0.480 |
HEAR |
wav2vec2 |
0.402 |
0.438 |
Logitech AI |
serab_byols |
0.396 |
0.414 |
ibkuroyagi |
hearline |
0.394 |
0.464 |
ID56-SSL |
audiomlp |
0.386 |
0.452 |
NTU-GURA |
fusion_hubert_xlarge |
0.382 |
0.410 |
NTU-GURA |
fusion_wav2vec2 |
0.330 |
0.354 |
CP-JKU |
base |
0.256 |
0.290 |
CP-JKU |
base2level |
0.256 |
0.290 |
CP-JKU |
base2levelmel |
0.256 |
0.290 |
Descript/MARL |
wav2clip_hear |
0.230 |
0.254 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.198 |
0.220 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.198 |
0.230 |
NTU-GURA |
hubert_xlarge |
0.184 |
0.204 |
AMAAI |
my_model |
0.182 |
0.202 |
AMAAI Lab |
wav2vec2_ddsp |
0.172 |
0.190 |
RedRice |
efficient_latent |
0.168 |
0.184 |
Soundsensing |
yamnet_hear |
0.158 |
0.190 |
CVSSP |
panns_hear |
0.148 |
0.162 |
NSynth Pitch 50h
Team Name |
Submission |
Pitch Acc |
Chroma Acc |
HEAR |
torchcrepe |
0.900 |
0.957 |
NTU-GURA |
cat_wc |
0.899 |
0.955 |
NTU-GURA |
cat_xwc |
0.897 |
0.955 |
NTU-GURA |
avg_hubert_crepe |
0.897 |
0.956 |
NTU-GURA |
avg_xwc |
0.896 |
0.955 |
NTU-GURA |
fusion_cat_xwc_time |
0.891 |
0.950 |
NTU-GURA |
fusion_cat_xwc |
0.885 |
0.945 |
UDONS |
embed |
0.795 |
0.863 |
Stellenbosch LSL |
audio_dbert |
0.737 |
0.816 |
MARL + Soundsensing |
openl3_hear |
0.731 |
0.773 |
Logitech AI |
serab_byols |
0.712 |
0.742 |
NTU-GURA |
fusion_hubert_xlarge |
0.688 |
0.729 |
HEAR |
wav2vec2 |
0.653 |
0.706 |
ID56-SSL |
audiomlp |
0.608 |
0.668 |
NTU-GURA |
fusion_wav2vec2 |
0.606 |
0.646 |
ID56-SSL |
kwmlp |
0.605 |
0.648 |
ibkuroyagi |
hearline |
0.589 |
0.656 |
CP-JKU |
base |
0.541 |
0.566 |
CP-JKU |
base2level |
0.541 |
0.566 |
CP-JKU |
base2levelmel |
0.541 |
0.566 |
Descript/MARL |
wav2clip_hear |
0.439 |
0.464 |
NTU-GURA |
hubert_xlarge |
0.429 |
0.458 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.428 |
0.460 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.415 |
0.444 |
RedRice |
efficient_latent |
0.391 |
0.418 |
Soundsensing |
yamnet_hear |
0.321 |
0.345 |
AMAAI Lab |
wav2vec2_ddsp |
0.303 |
0.332 |
CVSSP |
panns_hear |
0.301 |
0.323 |
AMAAI |
my_model |
0.176 |
0.210 |
DCASE 2016 Task 2
Team Name |
Submission |
Event Onset FMS |
Segment Error Rate |
CP-JKU |
base2levelmel |
0.925 |
0.099 |
CP-JKU |
base2level |
0.913 |
0.102 |
MARL + Soundsensing |
openl3_hear |
0.833 |
0.174 |
NTU-GURA |
fusion_cat_xwc |
0.826 |
0.145 |
NTU-GURA |
fusion_cat_xwc_time |
0.826 |
0.145 |
NTU-GURA |
fusion_hubert_xlarge |
0.826 |
0.150 |
NTU-GURA |
fusion_wav2vec2 |
0.798 |
0.163 |
RedRice |
efficient_latent |
0.790 |
0.231 |
CP-JKU |
base |
0.788 |
0.193 |
NTU-GURA |
cat_xwc |
0.681 |
0.287 |
UDONS |
embed |
0.668 |
0.310 |
HEAR |
wav2vec2 |
0.663 |
0.266 |
Logitech AI |
serab_byols |
0.642 |
0.382 |
NTU-GURA |
avg_xwc |
0.624 |
0.333 |
NTU-GURA |
avg_hubert_crepe |
0.610 |
0.337 |
NTU-GURA |
cat_wc |
0.585 |
0.392 |
NTU-GURA |
hubert_xlarge |
0.584 |
0.394 |
ID56-SSL |
kwmlp |
0.518 |
0.527 |
AMAAI |
my_model |
0.510 |
0.463 |
AMAAI Lab |
wav2vec2_ddsp |
0.507 |
0.502 |
HEAR |
torchcrepe |
0.504 |
0.493 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.452 |
0.574 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.417 |
0.582 |
Stellenbosch LSL |
audio_dbert |
0.246 |
0.747 |
ID56-SSL |
audiomlp |
0.052 |
0.965 |
Soundsensing |
yamnet_hear |
0.008 |
1.003 |
CVSSP |
panns_hear |
0.000 |
1.000 |
Descript/MARL |
wav2clip_hear |
0.000 |
1.000 |
Beehive States
Team Name |
Submission |
AUCROC |
Accuracy |
UDONS |
embed |
0.878 |
0.684 |
Descript/MARL |
wav2clip_hear |
0.770 |
0.684 |
ID56-SSL |
kwmlp |
0.760 |
0.552 |
Stellenbosch LSL |
audio_dbert |
0.697 |
0.582 |
MARL + Soundsensing |
openl3_hear |
0.604 |
0.498 |
HEAR |
torchcrepe |
0.593 |
0.535 |
Logitech AI |
serab_byols |
0.549 |
0.524 |
ID56-SSL |
audiomlp |
0.535 |
0.500 |
RedRice |
efficient_latent |
0.533 |
0.568 |
Soundsensing |
yamnet_hear |
0.466 |
0.479 |
CVSSP |
panns_hear |
0.446 |
0.429 |
Beijing Opera Percussion
Team Name |
Submission |
Accuracy |
MARL + Soundsensing |
openl3_hear |
0.975 |
NTU-GURA |
fusion_cat_xwc |
0.966 |
CP-JKU |
base |
0.966 |
CP-JKU |
base2level |
0.966 |
CP-JKU |
base2levelmel |
0.966 |
NTU-GURA |
fusion_cat_xwc_time |
0.962 |
RedRice |
efficient_latent |
0.953 |
Logitech AI |
serab_byols |
0.953 |
NTU-GURA |
fusion_hubert_xlarge |
0.949 |
NTU-GURA |
fusion_wav2vec2 |
0.945 |
NTU-GURA |
avg_xwc |
0.945 |
NTU-GURA |
hubert_xlarge |
0.945 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.941 |
Soundsensing |
yamnet_hear |
0.941 |
Descript/MARL |
wav2clip_hear |
0.936 |
NTU-GURA |
cat_xwc |
0.936 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.936 |
NTU-GURA |
avg_hubert_crepe |
0.932 |
HEAR |
torchcrepe |
0.928 |
ibkuroyagi |
hearline |
0.928 |
UDONS |
embed |
0.928 |
NTU-GURA |
cat_wc |
0.920 |
Stellenbosch LSL |
audio_dbert |
0.919 |
CVSSP |
panns_hear |
0.911 |
ID56-SSL |
kwmlp |
0.911 |
HEAR |
wav2vec2 |
0.907 |
AMAAI |
my_model |
0.826 |
ID56-SSL |
audiomlp |
0.728 |
AMAAI Lab |
wav2vec2_ddsp |
0.635 |
CREMA-D
Team Name |
Submission |
Accuracy |
Logitech AI |
serab_byols |
0.657 |
RedRice |
efficient_latent |
0.575 |
CVSSP |
panns_hear |
0.555 |
HEAR |
wav2vec2 |
0.656 |
Descript/MARL |
wav2clip_hear |
0.512 |
MARL + Soundsensing |
openl3_hear |
0.550 |
ID56-SSL |
kwmlp |
0.424 |
HEAR |
torchcrepe |
0.383 |
NTU-GURA |
avg_hubert_crepe |
0.540 |
NTU-GURA |
fusion_cat_xwc_time |
0.743 |
NTU-GURA |
cat_xwc |
0.639 |
NTU-GURA |
avg_xwc |
0.547 |
NTU-GURA |
cat_wc |
0.460 |
AMAAI |
my_model |
0.391 |
Soundsensing |
yamnet_hear |
0.453 |
NTU-GURA |
fusion_cat_xwc |
0.747 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.699 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.698 |
CP-JKU |
base |
0.610 |
CP-JKU |
base2level |
0.610 |
CP-JKU |
base2levelmel |
0.610 |
NTU-GURA |
hubert_xlarge |
0.690 |
NTU-GURA |
fusion_wav2vec2 |
0.692 |
UDONS |
embed |
0.441 |
ID56-SSL |
audiomlp |
0.420 |
NTU-GURA |
fusion_hubert_xlarge |
0.752 |
ibkuroyagi |
hearline |
0.480 |
Stellenbosch LSL |
audio_dbert |
0.522 |
ESC-50
Team Name |
Submission |
Accuracy |
CP-JKU |
base |
0.947 |
CP-JKU |
base2level |
0.947 |
CP-JKU |
base2levelmel |
0.947 |
RedRice |
efficient_latent |
0.935 |
CVSSP |
panns_hear |
0.909 |
Soundsensing |
yamnet_hear |
0.838 |
Logitech AI |
serab_byols |
0.805 |
Descript/MARL |
wav2clip_hear |
0.759 |
MARL + Soundsensing |
openl3_hear |
0.751 |
NTU-GURA |
fusion_hubert_xlarge |
0.743 |
NTU-GURA |
fusion_cat_xwc |
0.734 |
NTU-GURA |
fusion_wav2vec2 |
0.695 |
NTU-GURA |
fusion_cat_xwc_time |
0.653 |
NTU-GURA |
hubert_xlarge |
0.603 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.587 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.586 |
HEAR |
wav2vec2 |
0.561 |
Stellenbosch LSL |
audio_dbert |
0.532 |
NTU-GURA |
cat_xwc |
0.511 |
AMAAI |
my_model |
0.511 |
NTU-GURA |
avg_xwc |
0.450 |
NTU-GURA |
avg_hubert_crepe |
0.437 |
ibkuroyagi |
hearline |
0.424 |
UDONS |
embed |
0.401 |
ID56-SSL |
kwmlp |
0.367 |
NTU-GURA |
cat_wc |
0.343 |
HEAR |
torchcrepe |
0.300 |
AMAAI Lab |
wav2vec2_ddsp |
0.280 |
ID56-SSL |
audiomlp |
0.161 |
FSD50k
Team Name |
Submission |
mAP |
d' |
CP-JKU |
base |
0.641 |
2.652 |
CP-JKU |
base2levelmel |
0.641 |
2.650 |
CP-JKU |
base2level |
0.641 |
2.650 |
RedRice |
efficient_latent |
0.607 |
2.538 |
Logitech AI |
serab_byols |
0.509 |
2.218 |
MARL + Soundsensing |
openl3_hear |
0.447 |
2.117 |
NTU-GURA |
fusion_cat_xwc |
0.420 |
2.052 |
NTU-GURA |
fusion_hubert_xlarge |
0.413 |
2.079 |
NTU-GURA |
fusion_wav2vec2 |
0.403 |
2.017 |
NTU-GURA |
fusion_cat_xwc_time |
0.374 |
1.838 |
Descript/MARL |
wav2clip_hear |
0.362 |
1.980 |
HEAR |
wav2vec2 |
0.342 |
1.924 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.323 |
1.857 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.318 |
1.868 |
NTU-GURA |
cat_xwc |
0.314 |
1.847 |
NTU-GURA |
hubert_xlarge |
0.314 |
1.858 |
AMAAI Lab |
wav2vec2_ddsp |
0.268 |
1.820 |
NTU-GURA |
avg_xwc |
0.264 |
1.684 |
Stellenbosch LSL |
audio_dbert |
0.263 |
1.647 |
NTU-GURA |
avg_hubert_crepe |
0.253 |
1.683 |
NTU-GURA |
cat_wc |
0.234 |
1.574 |
AMAAI |
my_model |
0.198 |
1.666 |
ID56-SSL |
kwmlp |
0.187 |
1.511 |
ibkuroyagi |
hearline |
0.178 |
1.399 |
HEAR |
torchcrepe |
0.159 |
1.456 |
ID56-SSL |
audiomlp |
0.086 |
0.981 |
Gunshot Triangulation
Team Name |
Submission |
Accuracy |
NTU-GURA |
fusion_wav2vec2 |
0.967 |
MARL + Soundsensing |
openl3_hear |
0.949 |
CP-JKU |
base |
0.940 |
CP-JKU |
base2level |
0.940 |
CP-JKU |
base2levelmel |
0.940 |
NTU-GURA |
fusion_cat_xwc |
0.935 |
ID56-SSL |
kwmlp |
0.932 |
NTU-GURA |
hubert_xlarge |
0.932 |
NTU-GURA |
fusion_hubert_xlarge |
0.929 |
Descript/MARL |
wav2clip_hear |
0.929 |
NTU-GURA |
fusion_cat_xwc_time |
0.905 |
NTU-GURA |
cat_xwc |
0.881 |
RedRice |
efficient_latent |
0.878 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.869 |
UDONS |
embed |
0.866 |
HEAR |
torchcrepe |
0.863 |
NTU-GURA |
avg_xwc |
0.857 |
AMAAI |
my_model |
0.857 |
Logitech AI |
serab_byols |
0.857 |
HEAR |
wav2vec2 |
0.848 |
NTU-GURA |
avg_hubert_crepe |
0.845 |
NTU-GURA |
cat_wc |
0.833 |
Stellenbosch LSL |
audio_dbert |
0.810 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.810 |
CVSSP |
panns_hear |
0.798 |
ID56-SSL |
audiomlp |
0.786 |
Soundsensing |
yamnet_hear |
0.732 |
AMAAI Lab |
wav2vec2_ddsp |
0.583 |
ibkuroyagi |
hearline |
0.518 |
GTZAN Genre
Team Name |
Submission |
Accuracy |
MARL + Soundsensing |
openl3_hear |
0.879 |
RedRice |
efficient_latent |
0.878 |
Logitech AI |
serab_byols |
0.837 |
CVSSP |
panns_hear |
0.860 |
NTU-GURA |
cat_wc |
0.681 |
NTU-GURA |
avg_xwc |
0.706 |
NTU-GURA |
avg_hubert_crepe |
0.698 |
NTU-GURA |
cat_xwc |
0.722 |
HEAR |
torchcrepe |
0.645 |
Descript/MARL |
wav2clip_hear |
0.748 |
NTU-GURA |
fusion_cat_xwc_time |
0.760 |
HEAR |
wav2vec2 |
0.780 |
NTU-GURA |
fusion_cat_xwc |
0.805 |
ID56-SSL |
kwmlp |
0.554 |
Soundsensing |
yamnet_hear |
0.847 |
CP-JKU |
base |
0.883 |
CP-JKU |
base2level |
0.883 |
CP-JKU |
base2levelmel |
0.883 |
AMAAI |
my_model |
0.605 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.734 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.746 |
NTU-GURA |
hubert_xlarge |
0.735 |
ibkuroyagi |
hearline |
0.654 |
UDONS |
embed |
0.681 |
NTU-GURA |
fusion_wav2vec2 |
0.793 |
NTU-GURA |
fusion_hubert_xlarge |
0.796 |
Stellenbosch LSL |
audio_dbert |
0.674 |
ID56-SSL |
audiomlp |
0.408 |
GTZAN Music Speech
Team Name |
Submission |
Accuracy |
Logitech AI |
serab_byols |
0.938 |
CVSSP |
panns_hear |
0.992 |
RedRice |
efficient_latent |
0.968 |
Soundsensing |
yamnet_hear |
0.969 |
NTU-GURA |
avg_xwc |
0.937 |
MARL + Soundsensing |
openl3_hear |
0.969 |
NTU-GURA |
fusion_cat_xwc_time |
0.944 |
HEAR |
torchcrepe |
0.929 |
NTU-GURA |
avg_hubert_crepe |
0.946 |
Descript/MARL |
wav2clip_hear |
0.946 |
NTU-GURA |
cat_xwc |
0.961 |
NTU-GURA |
cat_wc |
0.938 |
NTU-GURA |
fusion_cat_xwc |
0.928 |
HEAR |
wav2vec2 |
0.946 |
AMAAI |
my_model |
0.907 |
ID56-SSL |
kwmlp |
0.889 |
NTU-GURA |
fusion_wav2vec2 |
0.953 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.936 |
CP-JKU |
base |
0.977 |
CP-JKU |
base2level |
0.977 |
CP-JKU |
base2levelmel |
0.977 |
NTU-GURA |
hubert_xlarge |
0.913 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.928 |
ID56-SSL |
audiomlp |
0.774 |
NTU-GURA |
fusion_hubert_xlarge |
0.936 |
Stellenbosch LSL |
audio_dbert |
0.969 |
UDONS |
embed |
0.899 |
ibkuroyagi |
hearline |
0.931 |
LibriCount
Team Name |
Submission |
Accuracy |
Logitech AI |
serab_byols |
0.785 |
NTU-GURA |
fusion_cat_xwc |
0.697 |
HEAR |
wav2vec2 |
0.692 |
NTU-GURA |
fusion_hubert_xlarge |
0.683 |
CP-JKU |
base |
0.660 |
CP-JKU |
base2level |
0.660 |
CP-JKU |
base2levelmel |
0.660 |
NTU-GURA |
fusion_cat_xwc_time |
0.659 |
NTU-GURA |
fusion_wav2vec2 |
0.653 |
Soundsensing |
yamnet_hear |
0.653 |
CVSSP |
panns_hear |
0.652 |
RedRice |
efficient_latent |
0.651 |
NTU-GURA |
hubert_xlarge |
0.646 |
MARL + Soundsensing |
openl3_hear |
0.641 |
NTU-GURA |
cat_xwc |
0.639 |
NTU-GURA |
avg_xwc |
0.631 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.627 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.623 |
NTU-GURA |
avg_hubert_crepe |
0.622 |
Stellenbosch LSL |
audio_dbert |
0.584 |
NTU-GURA |
cat_wc |
0.569 |
Descript/MARL |
wav2clip_hear |
0.528 |
AMAAI Lab |
wav2vec2_ddsp |
0.510 |
HEAR |
torchcrepe |
0.499 |
ibkuroyagi |
hearline |
0.498 |
UDONS |
embed |
0.488 |
ID56-SSL |
kwmlp |
0.451 |
AMAAI |
my_model |
0.432 |
ID56-SSL |
audiomlp |
0.366 |
MAESTRO 5h
Team Name |
Submission |
Note Onset FMS |
Note Onset w/ Offset FMS |
NTU-GURA |
cat_xwc |
0.4691 |
0.1552 |
NTU-GURA |
cat_wc |
0.4626 |
0.1496 |
NTU-GURA |
avg_hubert_crepe |
0.4624 |
0.1481 |
NTU-GURA |
avg_xwc |
0.4599 |
0.1486 |
NTU-GURA |
fusion_cat_xwc |
0.4413 |
0.1606 |
NTU-GURA |
fusion_cat_xwc_time |
0.4413 |
0.1606 |
HEAR |
torchcrepe |
0.4011 |
0.1537 |
UDONS |
embed |
0.2386 |
0.0756 |
NTU-GURA |
fusion_hubert_xlarge |
0.1657 |
0.0535 |
NTU-GURA |
fusion_wav2vec2 |
0.1110 |
0.0337 |
ID56-SSL |
kwmlp |
0.0648 |
0.0210 |
HEAR |
wav2vec2 |
0.0329 |
0.0101 |
ID56-SSL |
audiomlp |
0.0237 |
0.0095 |
MARL + Soundsensing |
openl3_hear |
0.0165 |
0.0023 |
Logitech AI |
serab_byols |
0.0079 |
0.0011 |
NTU-GURA |
hubert_xlarge |
0.0068 |
0.0018 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.0044 |
0.0013 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.0035 |
0.0007 |
AMAAI |
my_model |
0.0026 |
0.0006 |
ibkuroyagi |
hearline |
0.0016 |
0.0005 |
AMAAI Lab |
wav2vec2_ddsp |
0.0003 |
0.0001 |
RedRice |
efficient_latent |
0.0002 |
0.0000 |
Stellenbosch LSL |
audio_dbert |
0.0002 |
0.0000 |
CVSSP |
panns_hear |
0.0000 |
0.0000 |
Descript/MARL |
wav2clip_hear |
0.0000 |
0.0000 |
Soundsensing |
yamnet_hear |
0.0000 |
0.0000 |
Mridingham Stroke and Tonic
Mridingham Stroke
Team Name |
Submission |
Accuracy |
NTU-GURA |
fusion_cat_xwc_time |
0.975 |
NTU-GURA |
fusion_hubert_xlarge |
0.974 |
Logitech AI |
serab_byols |
0.973 |
NTU-GURA |
fusion_cat_xwc |
0.972 |
ID56-SSL |
kwmlp |
0.969 |
MARL + Soundsensing |
openl3_hear |
0.967 |
CP-JKU |
base |
0.965 |
CP-JKU |
base2level |
0.965 |
CP-JKU |
base2levelmel |
0.965 |
NTU-GURA |
fusion_wav2vec2 |
0.962 |
NTU-GURA |
hubert_xlarge |
0.953 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.951 |
RedRice |
efficient_latent |
0.949 |
Descript/MARL |
wav2clip_hear |
0.947 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.946 |
HEAR |
wav2vec2 |
0.943 |
CVSSP |
panns_hear |
0.939 |
NTU-GURA |
cat_xwc |
0.938 |
NTU-GURA |
avg_hubert_crepe |
0.917 |
NTU-GURA |
avg_xwc |
0.914 |
ibkuroyagi |
hearline |
0.909 |
HEAR |
torchcrepe |
0.898 |
NTU-GURA |
cat_wc |
0.898 |
ID56-SSL |
audiomlp |
0.893 |
UDONS |
embed |
0.880 |
Stellenbosch LSL |
audio_dbert |
0.835 |
AMAAI |
my_model |
0.759 |
AMAAI Lab |
wav2vec2_ddsp |
0.653 |
Mridingham Tonic
Team Name |
Submission |
Accuracy |
ID56-SSL |
kwmlp |
0.942 |
MARL + Soundsensing |
openl3_hear |
0.937 |
Logitech AI |
serab_byols |
0.928 |
NTU-GURA |
fusion_cat_xwc_time |
0.924 |
NTU-GURA |
fusion_cat_xwc |
0.923 |
NTU-GURA |
fusion_hubert_xlarge |
0.909 |
NTU-GURA |
cat_xwc |
0.859 |
NTU-GURA |
hubert_xlarge |
0.850 |
RedRice |
efficient_latent |
0.843 |
NTU-GURA |
fusion_wav2vec2 |
0.838 |
NTU-GURA |
avg_xwc |
0.837 |
NTU-GURA |
avg_hubert_crepe |
0.831 |
Descript/MARL |
wav2clip_hear |
0.829 |
HEAR |
wav2vec2 |
0.828 |
CVSSP |
panns_hear |
0.824 |
HEAR |
torchcrepe |
0.824 |
NTU-GURA |
cat_wc |
0.823 |
CP-JKU |
base |
0.819 |
CP-JKU |
base2level |
0.819 |
CP-JKU |
base2levelmel |
0.819 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.802 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.799 |
ibkuroyagi |
hearline |
0.783 |
ID56-SSL |
audiomlp |
0.778 |
UDONS |
embed |
0.730 |
Stellenbosch LSL |
audio_dbert |
0.685 |
AMAAI |
my_model |
0.415 |
AMAAI Lab |
wav2vec2_ddsp |
0.357 |
Vocal Imitations
Team Name |
Submission |
mAP |
Accuracy |
NTU-GURA |
fusion_cat_xwc_time |
0.215 |
0.196 |
NTU-GURA |
fusion_cat_xwc |
0.197 |
0.185 |
NTU-GURA |
fusion_hubert_xlarge |
0.185 |
0.172 |
CP-JKU |
base |
0.182 |
0.162 |
CP-JKU |
base2level |
0.182 |
0.162 |
CP-JKU |
base2levelmel |
0.182 |
0.162 |
NTU-GURA |
fusion_wav2vec2 |
0.174 |
0.160 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.166 |
0.157 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.165 |
0.153 |
Stellenbosch LSL |
audio_dbert |
0.161 |
0.145 |
Logitech AI |
serab_byols |
0.160 |
0.148 |
NTU-GURA |
hubert_xlarge |
0.154 |
0.142 |
RedRice |
efficient_latent |
0.138 |
0.134 |
CVSSP |
panns_hear |
0.127 |
0.116 |
NTU-GURA |
cat_xwc |
0.111 |
0.109 |
Soundsensing |
yamnet_hear |
0.085 |
0.080 |
NTU-GURA |
avg_xwc |
0.084 |
0.082 |
Descript/MARL |
wav2clip_hear |
0.083 |
0.073 |
HEAR |
wav2vec2 |
0.080 |
0.068 |
NTU-GURA |
avg_hubert_crepe |
0.079 |
0.079 |
MARL + Soundsensing |
openl3_hear |
0.078 |
0.060 |
NTU-GURA |
cat_wc |
0.076 |
0.076 |
AMAAI Lab |
wav2vec2_ddsp |
0.072 |
0.052 |
UDONS |
embed |
0.068 |
0.062 |
AMAAI |
my_model |
0.064 |
0.038 |
ID56-SSL |
kwmlp |
0.056 |
0.049 |
HEAR |
torchcrepe |
0.051 |
0.050 |
ibkuroyagi |
hearline |
0.040 |
0.038 |
ID56-SSL |
audiomlp |
0.038 |
0.036 |
Vox Lingua Top 10
Team Name |
Submission |
Accuracy |
NTU-GURA |
fusion_cat_xwc |
0.720 |
NTU-GURA |
fusion_hubert_xlarge |
0.714 |
NTU-GURA |
cat_hubert_wav2vec2 |
0.706 |
NTU-GURA |
fusion_wav2vec2 |
0.706 |
NTU-GURA |
avg_hubert_wav2vec2 |
0.690 |
NTU-GURA |
hubert_xlarge |
0.637 |
NTU-GURA |
fusion_cat_xwc_time |
0.629 |
HEAR |
wav2vec2 |
0.493 |
NTU-GURA |
cat_xwc |
0.460 |
Logitech AI |
serab_byols |
0.458 |
MARL + Soundsensing |
openl3_hear |
0.331 |
NTU-GURA |
avg_xwc |
0.321 |
NTU-GURA |
cat_wc |
0.310 |
NTU-GURA |
avg_hubert_crepe |
0.293 |
CP-JKU |
base2levelmel |
0.259 |
CP-JKU |
base |
0.259 |
CP-JKU |
base2level |
0.259 |
RedRice |
efficient_latent |
0.255 |
Stellenbosch LSL |
audio_dbert |
0.246 |
CVSSP |
panns_hear |
0.244 |
UDONS |
embed |
0.224 |
Soundsensing |
yamnet_hear |
0.202 |
Descript/MARL |
wav2clip_hear |
0.192 |
ID56-SSL |
kwmlp |
0.181 |
ibkuroyagi |
hearline |
0.180 |
AMAAI |
my_model |
0.169 |
HEAR |
torchcrepe |
0.142 |
ID56-SSL |
audiomlp |
0.129 |
AMAAI Lab |
wav2vec2_ddsp |
0.106 |