
HEAR 2021 Rules
Summary
Summary
In the spirit of shared exchange, all participants must submit an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. A primary goal of this challenge is to encourage the development of freely-available general-purpose audio representation models. If you have any questions or concerns about the rules, we are happy to help you. Please email us if you have any questions.
However, if you use HEAR 2021 datasets and do not follow the rules, please cite us and indicate that work is not compliant with HEAR 2021 rules.
To validate your entry follows the common API prior to submission, please run your entry with the HEAR Validator.
We also provide a baseline using the HEAR API, for example purposes.
Freely-available:
- You must release your code as a pip3-installable package under an Apache-2.0 or compatible (BSD, MIT, etc) license.
- Your model weights must be released under a Creative Commons Attribution 4.0 International License, or compatible license.
- You are welcome to use whatever training data you like, provided you adhere to all other
competition rules, and:
- Any existing data marked as test may not be used for training.
Easy-to-use:
- Your code must be written in
Python >= 3.6
and usePyTorch >= 1.7
orTensorflow >= 2.0
. - Your model must be able to return embeddings (either in GPU or CPU memory) for up to 20 minutes of audio without excedding 16GB of GPU memory. This memory constraint includes both model weights and embedding size.
Common format:
- Your code must follow a common API, described in detail in the section below.
- Your model must accept audio of arbitrary length under 20 minutes, as a tensor.
- Your model must work with audio at one of the following four samples rates:
[16000Hz, 22050Hz, 44100Hz, 48000Hz]
.- Your model must expose which sample rate it expects as input as a class attribute (see API details below), however it is not expected to resample audio internally.
- To avoid costly in-model resampling, we will a priori resample audio to all four sample rates for all tasks. (We will use ffmpeg—robust, cross platform, good format support, etc.—as the main command line tool for resampling, but with high quality resampling from sox).
- Your API must be able to produce two kinds of embeddings (described below):
- Timestamp-based embeddings: return time-centered embeddings
at regular intervals.
You may select the time interval (hop-size) between adjacent
embeddings, but we suggest that it is
<= 50ms
to handle an onset tolerance of50ms
for music transcription evaluation. - Scene embeddings: return a single embedding for a entire audio clip.
- Timestamp-based embeddings: return time-centered embeddings
at regular intervals.
You may select the time interval (hop-size) between adjacent
embeddings, but we suggest that it is
Sharing:
- We will provide you with a dev-kit including the data for the open tasks, and a script for performing evaluation.
- This dev-kit will also include a baseline embedding model in a standardized API (see below).
Common API
Your submission must implement the following API:
load_model(model_file_path: Str) -> Model
model_file_path
: Load model checkpoint from this file path.- Returns:
Model
- TensorFlow or PyTorch Module object
Model
A Model
(pytorch or tensorflow 2.x) class instance must have the
following attributes:
sample_rate
: Audio sample rate that your model expects. Must be one of[16000, 22050, 44100, 48000]
.-
scene_embedding_size: int
: The dimensionality of the embedding returned fromget_scene_embeddings
. timestamp_embedding_size: int
: The dimensionality of the embedding returned fromget_timestamp_embeddings
. If you wish, this can be identical toscene_embedding_size
.
get_timestamp_embeddings(
audio: Tensor,
model: Model,
) -> Tuple[Tensor, Tensor]
This function must return embeddings at regular intervals centered
at timestamps. The model must also return the corresponding timestamps,
in milliseconds. You are free to select the time interval between
adjacent embeddings (hop-size). We suggest that it is <= 50ms
to handle a temporal tolerance of 50ms
for music transcription
tasks. You are welcome to extend the API with an optional hop-size,
however please note that in HEAR 2021 we will use your default value
for all evaluation tasks.
audio
:n_sounds x n_samples
of mono audio in the range[-1, 1]
. All sounds in a batch will be padded/trimmed to the same length.model
: LoadedModel
.- Returns:
- embedding: A
float32
Tensor
with shape (n_sounds, n_timestamps, model.timestamp_embedding_size
). - timestamps: A
float32
Tensor
with shape (`n_sounds, n_timestamps). Centered timestamps in milliseconds corresponding to each embedding in the output.
- embedding: A
get_scene_embeddings(
audio: Tensor,
model: Model,
) -> Tensor
This function returns a single embedding for each audio clip.
It will be called to produce embeddings used for evaluation
tasks such as classification that look at an entire audio clip.
Participants are free to implement summarization of the temporal
aspects of audio into a single embedding in whatever way they wish.
A baseline approach would be to take the mean of all timestamp
embeddings returned from get_timestamp_embeddings
.
audio
:n_sounds x n_samples
of mono audio in the range[-1, 1]
. All sounds in a batch will be padded/trimmed to the same length.model
: LoadedModel
.- Returns:
- embedding: A
float32
Tensor
with shape (n_sounds, model.scene_embedding_size
).
- embedding: A
A note about pip installable packages. The organizers of the HEAR 2021 challenge feel strongly about general purpose models that are easy to access and easy to use. As such, we have fairly strict requirements for a pip installable package. We realize that this may pose a challenge to some entrants. If this criterion poses an issue for you, the HEAR team would be glad to help. Please reach out to us by e-mail.
Code must hosted in a publicly facing GitHub repository. We will
clone your repository and install it using a
local source tree pip install
i.e., python3 -m pip install <repo-path>
. Your
package does not need to be uploaded to PyPI. Your model weights
must also be available to download at a publicly accessible URL.
Please include a README on your GitHub repository that contains
additional important information for running your submission including
the CUDA and cuDNN versions.
Make sure that your submission follows the common API specified above. To help with this, we have developed a tool for participants that validates a submission against the API: hear-validator.
If you have any questions or are concerned about hosting your submission publicly, please do not hesitate to contact competition organizers.