HEAR 2021 Datasets


These are the evaluation tasks for the HEAR (Holistic Evaluation of Audio Representations) 2021 NeurIPS challenge.

The datasets have different open licenses. Please see LICENSE.txt for each individual dataset's license.

Datasets were all normalized to a common human-readable format using hearpreprocess. Until 2022-04-01, datasets will be mirrored at data.neuralaudio.ai. Longer term, our Zenodo mirror has all audio task but only at 48000Hz sampling rate. For other sampling rates (16000, 22050, 32000, 44100), please download files (requester pays) from Google Storage gs://hear2021-archive/tasks/ or AWS s3://hear2021-archive/tasks/ (with CLI flag --requester-payer requester).

Open Tasks:

Secret Tasks:

Speech Commands

Classification of known spoken commands, with additional categories for silence and unknown commands. This task was described in Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. As per the literature, we measure accuracy. We also provide scores for a 5-hour subset.

NSynth Pitch

NSynth Pitch is a HEAR 2021 open task and is a multiclass classification problem. The goal of this task is to classify instrumental sounds from the NSynth Dataset into one of 88 pitches. Results for this task are measured by pitch accuracy as well as chroma accuracy. The chroma accuracy metric only considers the pitch class and disregards octave errors.

For HEAR 2021 we created two versions of this dataset: a 5 hour and 50 hour version.

DCASE 2016 Task 2

Adapted from DCASE 2016, Task 2 office sound event detection. Our evaluation uses different splits, so the numbers cannot be directly compared to previously published results.

Postprocessing: Segments were postprocessed using 250 ms median filtering. At each validation step, a minimum event duration of 125 or 250 ms was chosen to maximize onset-only event-based F-measure (with 200ms tolerance). Scores were computed using sed_eval.

Beehive States

This is a binary classification task using audio recordings of two beehives. The beehives are in one of two states: a Queen-less beehive, where for some reason the Queen is missing, and a normal beehive. There are 930 clips in this data set, which are mostly 10 minutes long. (Nolasco et al. 2019)

Beijing Opera Percussion

This is a novel audio classification task developed using the Beijing Opera Percussion Instrument Dataset. The Beijing Opera uses six main percussion instruments that can be classified into four main categories: Bangu, Naobo, Daluo, and Xiaoluo. There are 236 audio clips. Scores are averaged over 5-folds.


CREMA-D is a dataset for emotion recognition. The original dataset contains audiovisual data of actors reciting sentences with one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad). For HEAR 2021 we only use the audio recordings. As per the literature, we use 5-fold cross validation. There are 7438 clips.


This is a multiclass classification task on environmental sounds. The ESC-50 dataset is a collection of 2000 environmental sounds organized into 50 classes. Scores are averaged over 5 folds. (The folds are predefined in the original dataset.)


FSD50K is a multilabel task. The dataset contains a large collection of human-labeled sound events from Freesound. Each sound event is labeled using classes from the AudioSet Ontology, which consists of 200 classes.

All other scene embedding tasks used a fixed audio clip length. However, for FSD50k scene embedding, we did not alter the audio clip length. Each clip is between 0 and 30 seconds long.

Gunshot Triangulation

The gunshot triangulation task is a novel resource multiclass classification task that utilizes the dataset: Gunshots recorded in an open field using iPod Touch devices. This dataset consists of 22 shots from 7 different firearms, a total of 88 audio clips. Each shot is recorded using four different iPod Touches located at different distances from the shooter. The goal of this task is to classify audio by the iPod Touch that recorded it, i.e. classify the location of the microphone.

The dataset was split into 7 different folds where each firearm belonged to only one fold. Results are averaged over each fold.


This is a multiclass classification task. The GTZAN Genre Collection is a dataset of 1000 audio tracks (each 30 seconds in duration) that are categorized into ten genres (100 tracks per genre). As per the literature, scores are averaged over 10 folds.

GTZAN Music Speech

GTZAN Music Speech is a binary classification task. The goal is to distinguish between music and speech. The dataset consists of 120 tracks (each 30 seconds in duration) and each class (music/speech) has 60 examples. As per the literature, scores are averaged over 10 folds.


LibriCount is a multiclass speaker count identification task. We created this task using LibriCount, a dataset for speaker count estimation. The dataset contains audio that is simulated cocktail party environment with 0 to 10 speakers. The goal of this task is to classify how many speakers are present in each of the recordings. Scores are averaged over 5 folds.


Music transcription task adapted from MAESTRO. For HEAR 2021 we created a subsampled version of this task that included 5 hours of training + validation audio, using 120 second clips. A shallow transcription model was trained on timestamp based embeddings provided by participant models.

We use note onset FMS and note onset with offset FMS for evaluation, as per the original MAESTRO paper (Hawthorne et al., 2019) and the preceding Onsets and Frames paper (Hawthorne er al., 2018).

Note onset measures the ability of the model to estimate note onsets with 50ms tolerance and ignores offsets. Note onset w/ offset includes onsets as well as requires note duration within 20% of ground truth or within 50ms, whichever is greater.

Mridingham Stroke and Tonic

We used the Mridangam Stroke Dataset for two distinct multiclass classification tasks: Stroke classification and Tonic classification. The Mridingam is a pitched percussion instrument used in carnatic music, a sub-genre of Indian classical music. The dataset comprises 10 different strokes played on Mridingams with 6 different tonics. For each of the two tasks (stroke and tonic), scores are averaged over 5 folds.

Vocal Imitations

This is a multiclass classification task where the goal is to match a vocal imitation of a sound with the original sound that is being imitated. This dataset contains 5601 vocal imitations of 302 reference sounds, organized by AudioSet ontology. Given a vocal sound, the classification task is to retrieve the original audio it was imitating. Scores are averaged over 3 folds.

Vox Lingua Top 10

This is a novel multiclass classification task derived from the VoxLingua107 dataset. The goal of the task is to identify the spoken language in an audio file. For HEAR 2021 we selected the top 10 most frequent languages from the development set, which resulted in just over 5 hours of audio over 972 audio clips. Scores are averaged over 5 folds.