ASMD: Audio-Score Meta Dataset

Installation

I suggest to clone this repo and to use it with python >= 3.6. If you need to use it in multiple projects (or folders), just clone the code and fix the install_dir in datasets.json, so that you can have only one copy of the huge datasets.

The following describes how to install dependecies needed for the usage of the dataset API. I suggest to use poetry to manage different versions of python and virtual environments with an efficient dependency resolver.

During the installation, the provided ground-truth will be extracted; however, you can recreate them from scratch for tweaking parameters. The next section will explain how you can achieve this.

The easy way

  1. pip install asmd
  2. Install wget if you want SMD dataset (next release will remove this dependency)
  3. Run python -m asmd.install and follows the steps

The hard way (if you want to contribute)

Once you have cloned the repo follow these steps:

Install poetry, pyenv and python

  1. Install wget if you want SMD dataset (next release will remove this dependency)
  2. Install python 3
  3. Install poetry
  4. Install pyenv and fix your .bashrc (optional)
  5. pyenv install 3.6.9 (optional, recommended python >= 3.6.9)
  6. poetry new myproject
  7. cd myproject
  8. pyenv local 3.6.9 (optional, recommended python >= 3.6.9)

Setup new project or testing

  1. git clone https://gitlab.di.unimi.it/federicosimonetta/asmd/
  2. poetry add asmd/
  3. Execute poetry run python -m asmd.install; alternatively run poetry shell and then python -m asmd.install
  4. Follow the steps

Now you can start developing in the parent directory (myproject) and you can use from asmd import audioscoredataset as asd.

Use poetry to manage packages of your project.

Alternative way

  1. clone the project.
  2. to build the modules in place, run poetry run python setup.py build_ext --inplace
  3. create an ad-hoc directory for testing anywhere:
    1. copy there the original pyproject.toml
    2. install the needed dependencies with poetry update
    3. export the PYTHONPATH environment variable: e.g. export PYTHONPATH="path/to/asmd".

You’re now ready to use ASMD without downloading it from PyPI.

To create the python package, just run poetry build.

Reproduce from scratch

To recreate the ground-truth in our format you have to convert the annotations using the scirpt generate_ground_truth.py.

N.B. You should have ``wget`` installed in your system, otherwise SMD dataset can’t be downloaded.

You can run the script with python 3. You can also skip the already existing datasets by using the --blacklist and --whitelist argument. If you do this, their ground truth will not be added to the final archive, thus, remember to backup the previous one and to merge the archives.

Generate misaligned data

If you want, you can generate misaligned data using the --train and --misalign options of generate_ground_truth.py. It will run alignment_stats.py, which collects data about the datasets with real non-aligned scores and saves stats in _alignment_stats.pkl file in the ASMD module directory. Then, it runs generate_ground_truth.py using the collected statistics: it will generate misaligned data by using the same deviation distribution of the available non-aligned data.

Note that misaligned data should be annotated as 2 in the ground_truth value of the dataset groups description (see ASMD: Audio-Score Meta Dataset ), otherwise no misaligned value will be added to the misaligned field. Moreover, the dataset group data should have precise_alignment or broad_alignment filled by the annotation conversion step, otherwise errors can raise during the misalignment procedure.

For more info, see python -m asmd.generate_ground_truth -h.

A usual pipeline is:

  1. Generate music score data and other ground-truth except artificial one: python -m asmd.generate_ground_truth --normal
  2. Train a statistical model (can skip this): python -m asmd.generate_ground_truth --train
  3. Generate misalignment using the trained model (trains it if not available): python -m asmd.generate_ground_truth --misalign

Usage

datasets.json

The root element is a dictionary with fields:

  1. author: string containing the name of the author
  2. year: int containing the year
  3. install_dir: string containing the install directory
  4. datasets: list of datasets object
  5. decompress_path: the path were files are decompressed

Definitions

Each dataset is described by a JSON file which. Each dataset has the following field:

  1. ensemble: true if contains multiple instruments, false otherwise

  2. groups: list of strings representing the groups contained in this dataset; the default name all must be always present

  3. instruments: the list of the instruments contained in the dataset

  4. sources:

    1. format: the format of the audio recordings of the single source-separated tracks
  5. recording:

    1. format: the format of the audio recordings of the mixed tracks
  6. ground_truth: N.B. each ground_truth has an ``int`` value, indicating ``0`` -> false, ``1`` -> true (manual or mechanical - Disklavier - annotation), ``2`` -> true (automatic annotation with state-of-art algorithms)

    1. [group-name] : a dictionary representing the ground-truth contained by each dataset group

      1. misaligned: if artificially misaligned scores are provided
      2. score: if original scores are provided
      3. broad_alignment: if broad_alignment scores are provided
      4. precise_alignment: if precisely aligned scores are provided
      5. velocities: if velocities are provided
      6. f0: if f0 values are provided
      7. sustain: if sustain values are provided
      8. soft: if sustain values are provided
      9. sostenuto: if sustain values are provided
  7. songs: the list of songs in the dataset

    1. composer: the composer family name
    2. instruments: list of instruments in the song
    3. recording: dictionary
      1. path: a list of paths to be mixed for reconstructing the full track (usually only one)
    4. sources: dictionary
      1. path: a list of paths to the single instrument tracks in the same order as instruments
    5. ground_truth: list of paths to the ground_truth json files. One ground_truth path per instrument is always provided. The order of the ground_truth path is the same of sources and of the instruments. Note that some ground_truth paths can be identical (as in PHENICX for indicating that violin1 and violin2 are playing exactly the same thing).
    6. groups: list of strings representing a group of the dataset. The group all must always be there; any other string is possible and should be exposed in the groups field at dataset-level
  8. install: where information for the installation process are stored

    1. url: the url to download the dataset including the protocol
    2. post-process: a list of shell commands to be executed to prepare the dataset; they can be lists themselves to allow the use of references to the installation directory with the syntax &install_dir: every occurrence of &install_dir will be replaced with the value of install_dir in datasets.json; final slash doesn’t matter
    3. unpack: true if the url needs to be unpacked (untar, unzip, …)
    4. login: true if you a login is needed - not used anymore, but maybe useful in future

In general, I maintained the following principles:

  1. if a list of files is provided where you would logically expect one file, you should ‘sum’ the files in the list, whatever this means according to that type of file; this typically happens in the ground_truth files. or in the recording where only the single sources are available.
  2. all the fields can have the value ‘unknown’ to indicate that it is not available in that dataset; if you treat ‘unknown’ with the meaning of unavailable everything will be fine; however, in some cases it can mean that the data are available but that information is not documented.

Ground-truth json format

The ground_truth is contained in JSON files indexed in each definition file. Each ground truth file contains only one isntrument in a dictionary with the following structure:

  1. score:
    1. onsets: onsets in seconds; if BPM is not available, timings are computed using 60 BPM
    2. offsets: offsets in seconds; if BPM is not available, timings are computed using 60 BPM
    3. pitches: list of midi pitches in onset ascending order and range [0-127]
    4. notes: list of note names in onsets ascending order
    5. velocities: list of velocities in onsets ascending order and range [0-127]
    6. beats: list of times in which there was a beat in the original score; use this to reconstruct instant BPM
  2. misaligned:
    1. onsets: onsets in seconds
    2. offsets: offsets in seconds
    3. pitches: list of midi pitches in onset ascending order and range [0-127]
    4. ``notes`: list of note names in onsets ascending order
    5. velocities: list of velocities in onsets ascending order and range [0-127]
  3. precise_alignment:
    1. onsets: onsets in seconds
    2. offsets: offsets in seconds
    3. pitches: list of midi pitches in onset ascending order and range [0-127]
    4. ``notes`: list of note names in onsets ascending order
    5. velocities: list of velocities in onsets ascending order and range [0-127]
  4. broad_alignment: alignment which does not consider the asynchronies between simultaneous notes
    1. onsets: onsets in seconds
    2. offsets: offsets in seconds
    3. pitches: list of midi pitches in onset ascending order and range [0-127]
    4. ``notes`: list of note names in onsets ascending order
    5. velocities: list of velocities in onsets ascending order and range [0-127]
  5. missing: list of boolean values indicating which notes are missing in the score (i.e. notes that you can consider as being played but not in the score); use this value to mask the performance/score
  6. extra: list of boolean values indicating which notes are extra in the score (i.e. notes that you can consider as not being played but in the score); use this value to mask the performance/score
  7. f0: list of f0 frequencies, frame by frame; duration of each frame should be 46 ms with 10 ms of hop.
  8. sustain:
    1. values: list of sustain changes; each susvalue is a number between 0 and 127, where values < 63 mean sustain OFF and values >= 63 mean sustain ON, but intermediate values can be used (e.g. for half-pedaling).
    2. times: list of floats representing the time of each sustain change in seconds.
  9. soft:
    1. values: list of soft-pedal changes; each value is a number between 0 and 127, where values < 63 mean soft pedal OFF and values >= 63 mean soft pedal ON, but intermediate values can be used (e.g. for half-pedaling).
    2. times: list of floats representing the time of each soft pedal change in seconds.
  10. sostenuto:
    1. values: list of sostenuto-pedal changes; each value is a number between 0 and 127, where values < 63 mean sostenuto pedal OFF and values >= 63 mean sostenuto pedal ON, but intermediate values can be used (e.g. for half-pedaling).
    2. times: list of floats representing the time of each sostenuto pedal change in seconds.
  11. instrument: General Midi program number associated with this instrument, starting from 0. 128 indicates a drum kit (should be synthesized on channel 8 with a program number of your choice, usually 0). 255 indicates no instrument specified.

Note that json ground_truth files have extension .json.gz, indicating that they are compressed using the gzip Python module. Thus, you need to decompress them:

import gzip
import json

ground_truth = json.load(gzip.open(‘ground_truth.json.gz’, ‘rt’))

print(ground_truth)

Adding new datasets

In order to add new datasets, you have to create the correspondent definition in a JSON file. The definitions can be in any directory but you have to provide this path to the API and to the installation script (you will be asked for this, so you can’t be wrong).

The dataset files, instead, should be in the installation directory and the paths in the definition should not take into account the installation directory.

If you also want to add the new dataset to the installation procedure, you should:

  1. Provide a conversion function for the ground truth
  2. Add the conversion function with all parameters to the JSON definition (section install>conversion)
  3. Rerun the install.py and convert_gt.py scripts

Adding new definitions

The most important thing is that one ground-truth file is provided for each instrument.

If you want to add datasets to the installation procedure, taking advantage of the artificially misalignment, add the paths to the files (ground-truth, audio, etc.), even if they still do not exist, because convert_gt.py relies on those paths to create the files. It is important to provide an index starting with - at the end of the path (see the other sections as example), so that it is possible to distinguish among multiple instruments (for instance, PHENICX provides one ground-truth file for all the violins of a song, even if there are 4 different violins). The index allows convert_gt to better handle different files and to pick the ground-truth wanted.

It is mandatory to provide a url, a name and so on. Also, provide a composer and instrument list. Please, do not use new words for instruments already existing (for instance, do not use saxophone if sax already exists in other datasets).

Provide a conversion function

Docs available at Utilities to convert from ground-truth

The conversion function takes as input the name of the file in the original dataset. You can also use the bundled conversion functions (see docs).

  1. use deepcopy(gt) to create the output ground truth.
  2. use decorator @convert to provide the input file extensions and parameters

You should consider three possible cases for creating the conversion function:

  1. there is a bijective relationship between instruments and ground_truth file you have, that is, you already have a convesion file per each instrument and you should just convert all of them (1-to-1 relationship)
  2. in your dataset, all the instruments are inside just one ground-truth file (n-to-1 relationship)
  3. just one ground-truth file is provided that replicates for multiple instruments (one ground-truth for all the violins, as if they were a single instrument, 1-to-n relationship )

Here is a brief description of how your conversion function should work to tackle these three different situations. - In the 1st case, you can just output a list with only one dictionary. - In the 2nd case, you can output a list with all the dictionary inside it, in the same order as the ground-truth file paths you added to datasets.json. The script will repeatly convert them and each times it will pick a different element of the list. - In the 3rd case, you can still output a single element list.

If you want to output a list with only one dict, you can also output the dict itself. The decorator will take care of handling file names and of putting the output dict inside a list.

Finally, you can also use multiple conversion functions if your ground-truth is splitted among multiple files, but note that the final ground-truth is produced as the sum of all the elements of all the dictionaries created.

Add your function to the JSON definition

In the JSON definitions, you should declare the functions that should be used for converting the ground-truth and their parameters. The section where you can do this is in install>conversion.

Here, you should put a list like the following:

[
    [
        "module1.function1", {
            "argument1_name": argument1_value,
            "argument2_name": argument2_value
        }
],
    [
        "module2.function2", {
            "argument1_name": argument1_value,
            "argument2_name": argument2_value
        }
    ]
]

Note that you have to provide the name of the function, which will be evaluated with the eval python function. Also, you can use any function in any module, included the bundled functions - in this case, use just the function name w/o the module.

Utilities to convert from ground-truth

asmd.convert_from_file._sort_alignment(alignment, data)[source]

Sort data in alignment (in-place)

asmd.convert_from_file._sort_lists(*lists)[source]

Sort multiple lists in-place with reference to the first one

asmd.convert_from_file._sort_pedal(data)[source]

Sort pedal for data (in-place)

asmd.convert_from_file.change_ext(input_fn, new_ext, no_dot=False, remove_player=False)[source]

Return the input path input_fn with new_ext as extension and the part after the last ‘-’ removed. If no_dot is True, it will not add a dot before of the extension, otherwise it will add it if not present. remove_player can be used to remove the name of the player in the last part of the file name when: use this for the traditional_flute dataset; it will remove the last part after ‘_’.

asmd.convert_from_file.convert(exts, no_dot=True, remove_player=False)[source]

This function is designed to be used as decorators for functions which converts from a filetype to our JSON format.

Example of usage:

>>> @convert(['.myext'], no_dot=True, remove_player=False)
... def function_which_converts(...):
...     pass
Parameters:
  • ext (*) – the possible extensions of the ground-truths to be converted, e.g. [‘.mid’, ‘.midi’]. You can also use this parameter to remove exceeding parts at the end of the filename (see from_bach10_mat and from_bach10_f0 source code)
  • no_dot (*) – if True, don’t add a dot before of the extension, if False, add it if not present; this is useful if you are using the extension to remove other parts in the file name (see ext).
  • remove_player (*) – if True, remove the name of the player in the last part of the file name: use this for the traditional_flute dataset; it will remove the part after the last ‘_’.
asmd.convert_from_file.from_bach10_f0(nmat_fn, sources=range(0, 4))[source]

Open a matlab mat file nmat_fn in the MIREX format (Bach10) for frame evaluation and convert it to our ground_truth representation. This fills: f0. sources is an iterable containing the indices of the sources to be considered, where the first source is 0. Returns a list of dictionary, one per source.

asmd.convert_from_file.from_bach10_mat(mat_fn, sources=range(0, 4))[source]

Open a txt file txt_fn in the MIREX format (Bach10) and convert it to our ground_truth representation. This fills: precise_alignment, pitches. sources is an iterable containing the indices of the sources to be considered, where the first source is 0. Returns a list of dictionary, one per source.

asmd.convert_from_file.from_midi(midi_fn, alignment='precise_alignment', pitches=True, velocities=True, merge=True, beats=False)[source]

Open a midi file midi_fn and convert it to our ground_truth representation. This fills velocities, pitches, beats, sustain, soft, sostenuto and alignment (default: precise_alignment). Returns a list containing a dictionary. alignment can also be None or False, in that case no alignment is filled. If merge is True, the returned list will contain a dictionary for each track. Beats are filled according to tempo changes.

This functions is decorated with 3 different sets of parameters:

  • from_midi is the decorated version with remove_player=False
  • from_midi_remove_player is the decorated version with remove_player=True
  • from_midi_asap is the decorated version which accept extension ‘.score.mid’ which is used in the script to import scores from ASAP

N.B. To allow having some annotation for subgroups of a dataset, this function returns None when it cannot find the specified midi file; in this way, that file is not taken into account while merging the various annotations (e.g. asap group inside Maestro dataset)

asmd.convert_from_file.from_midi_asap(midi_fn, alignment='precise_alignment', pitches=True, velocities=True, merge=True, beats=False)

Open a midi file midi_fn and convert it to our ground_truth representation. This fills velocities, pitches, beats, sustain, soft, sostenuto and alignment (default: precise_alignment). Returns a list containing a dictionary. alignment can also be None or False, in that case no alignment is filled. If merge is True, the returned list will contain a dictionary for each track. Beats are filled according to tempo changes.

This functions is decorated with 3 different sets of parameters:

  • from_midi is the decorated version with remove_player=False
  • from_midi_remove_player is the decorated version with remove_player=True
  • from_midi_asap is the decorated version which accept extension ‘.score.mid’ which is used in the script to import scores from ASAP

N.B. To allow having some annotation for subgroups of a dataset, this function returns None when it cannot find the specified midi file; in this way, that file is not taken into account while merging the various annotations (e.g. asap group inside Maestro dataset)

asmd.convert_from_file.from_midi_remove_player(midi_fn, alignment='precise_alignment', pitches=True, velocities=True, merge=True, beats=False)

Open a midi file midi_fn and convert it to our ground_truth representation. This fills velocities, pitches, beats, sustain, soft, sostenuto and alignment (default: precise_alignment). Returns a list containing a dictionary. alignment can also be None or False, in that case no alignment is filled. If merge is True, the returned list will contain a dictionary for each track. Beats are filled according to tempo changes.

This functions is decorated with 3 different sets of parameters:

  • from_midi is the decorated version with remove_player=False
  • from_midi_remove_player is the decorated version with remove_player=True
  • from_midi_asap is the decorated version which accept extension ‘.score.mid’ which is used in the script to import scores from ASAP

N.B. To allow having some annotation for subgroups of a dataset, this function returns None when it cannot find the specified midi file; in this way, that file is not taken into account while merging the various annotations (e.g. asap group inside Maestro dataset)

asmd.convert_from_file.from_musicnet_csv(csv_fn, sr=44100.0)[source]

Open a csv file csv_fn and convert it to our ground_truth representation. This fills: broad_alignment, score, pitches. This returns a list containing only one dict. sr is the samplerate of the audio files (MusicNet csv contains the sample number as onset and offsets of each note) and it shold be a float.

N.B. MusicNet contains wav files at 44100 Hz as samplerate. N.B. Lowest in pitch in musicnet is 21, so we assume that they count pitch starting with 0 as in midi.org standard. N.B. score times are provided with BPM 60 for all the scores

asmd.convert_from_file.from_phenicx_txt(txt_fn)[source]

Open a txt file txt_fn in the PHENICX format and convert it to our ground_truth representation. This fills: broad_alignment.

asmd.convert_from_file.from_sonic_visualizer(gt_fn, alignment='precise_alignment')[source]

Takes a filename of a sonic visualizer output file exported as ‘csv’ and fills the ‘alignment’ specified

asmd.convert_from_file.prototype_gt = {'broad_alignment': {'notes': [], 'offsets': [], 'onsets': [], 'pitches': [], 'velocities': []}, 'extra': [], 'f0': [], 'instrument': 255, 'misaligned': {'notes': [], 'offsets': [], 'onsets': [], 'pitches': [], 'velocities': []}, 'missing': [], 'precise_alignment': {'notes': [], 'offsets': [], 'onsets': [], 'pitches': [], 'velocities': []}, 'score': {'beats': [], 'notes': [], 'offsets': [], 'onsets': [], 'pitches': [], 'velocities': []}, 'soft': {'times': [], 'values': []}, 'sostenuto': {'times': [], 'values': []}, 'sustain': {'times': [], 'values': []}}

The dictionary prototype for containing the ground_truth. use:

>>> from copy import deepcopy
... from convert_from_file import prototype_gt
... prototype_gt = deepcopy(prototype_gt)
>>> prototype_gt
{
    "precise_alignment": {
        "onsets": [],
        "offsets": [],
        "pitches": [],
        "notes": [],
        "velocities": []
    },
    "misaligned": {
        "onsets": [],
        "offsets": [],
        "pitches": [],
        "notes": [],
        "velocities": []
    },
    "score": {
        "onsets": [],
        "offsets": [],
        "pitches": [],
        "notes": [],
        "velocities": [],
        "beats": []
    },
    "broad_alignment": {
        "onsets": [],
        "offsets": [],
        "pitches": [],
        "notes": [],
        "velocities": []
    },
    "f0": [],
    "soft": {
        "values": [],
        "times": []
    },
    "sostenuto": {
        "values": [],
        "times": []
    },
    "sustain": {
        "values": [],
        "times": []
    },
    "instrument": 255,
}

Note: pitches, velocities, sustain, sostenuto, soft, and (if available) instrument must be in range [0, 128)

Python API

Intro

This project also provides a few API for filtering the datasets according to some specified prerequisites and getting the data in a convenient format.

Python

Import audioscoredataset and create a Dataset object, giving the path of the datasets.json file in this directory as argument to the constructor. Then, you can use the filter method to filter data according to your needs (you can also re-filter them later without reloading datasets.json).

You will find a value paths in your Dataset instance containing the correct paths to the files you are requesting.

Moreover, the method get_item returns an array of audio values and a structured_array representing the ground_truth as loaded from the json file.

Example:

from asmd import asmd

d = asmd.Dataset()
# d = asd.Dataset(paths=['path_to_my_definitions', 'path_to_default_definitions'])
d.filter(instrument='piano', ensemble=False, composer='Mozart', ground_truth=['precise_alignment'])

audio_array, sources_array, ground_truth_array = d.get_item(1)

audio_array = d.get_mix(2)
source_array = d.get_source(2)
ground_truth_list = d.get_gts(2)

mat = get_score_mat(2, score_type=['precise_alignment'])

Note that you can inherit from asmd.Dataset and torch.utils.data.Dataset to create a PyTorch compatible dataset which only load audio files when thay are accessed. You will just need to implement the __getitem__ method.

Documentation

asmd.asmd.load_definitions(path)[source]

Given a path to a directory, returns a list of dictionaries containing the definitions found in that directory (not recursive search)

class asmd.asmd.Dataset(paths=['default_path'], metadataset_path=['default_path'])[source]
__init__(definitions=['/home/docs/checkouts/readthedocs.org/user_builds/asmd/checkouts/latest/asmd/definitions/'], metadataset_path='/home/docs/checkouts/readthedocs.org/user_builds/asmd/checkouts/latest/asmd/datasets.json', empty=False)[source]

Load the dataset description and populate the paths

This object has a fundamental field named paths which is a list; each entry contain another list of 3 values representing thepath to, respectively: mixed recording, signle-sources audio, ground-truth file per each source

Parameters:
  • definitions (*) – paths where json dataset definitions are stored; if empty, the default definitions are used
  • metadataset_path (*) – the path were the generic information about where this datetimeis installed are stored
  • empty (*) – if True, no definition is loaded
Returns:

instance of the class

Return type:

* AudioScoreDataset

__weakref__

list of weak references to the object (if defined)

get_audio(idx, sources=None)[source]

Get the mixed audio of certain sources or of the mix

Parameters:
  • idx (int) – The index of the song to retrieve.
  • sources (list or None) – A list containing the indices of sources to be mixed and returned. If None, no sources will be mixed and the global mix will be returned.
Returns:

  • numpy.ndarray – A (n x 1) array which represents the mixed audio.
  • int – The sampling rate of the audio array

get_audio_data(idx)[source]

Returns audio data of a specific item without loading the full audio.

N.B. see essentia.standard.MetadataReader!

Returns:
  • list of tuples – each tuple is referred to a source and contains the following
  • int – duration in seconds
  • int – bitrate (kb/s)
  • int – sample rate
  • int – number of channels
get_beats(idx)[source]

Get a list of beat position in seconds, to be used together with the score data.

Parameters:idx (int) – The index of the song to retrieve.
Returns:each row contains beat positions of each ground truth
Return type:numpy.ndarray
get_gts(idx)[source]

Return the ground-truth of the wanted item

Parameters:idx (int) – the index of the wanted item
Returns:list of dictionary representing the ground truth of each single source
Return type:list
get_gts_paths(idx) → List[str][source]

Return paths to the ground-truth files, one for each source

Returns list of string

get_initial_bpm(idx) → Optional[float][source]

Return the initial bpm of the first source if score alignment type is available at index idx, otherwise returns None

get_item(idx)[source]

Returns the mixed audio, sources and ground truths of the specified item.

Parameters:idx (int) – the index of the wanted item
Returns:
  • numpy.ndarray – audio of the mixed sources
  • list – a list of numpy.ndarray representing the audio of each source
  • list – list of dictionary representing the ground truth of each single source
get_missing_extra_notes(idx, kind: str) → List[numpy.ndarray][source]

Returns the missing or extra notes of a song. For each source, an array of boolean values is returned. If you want the missing/extra notes for the whole song, use dataset_utils.get_score_mat

kind can be ‘extra’ or ‘missing’

get_mix(idx, sr=None)[source]

Returns the audio array of the mixed song

Parameters:
  • idx (int) – the index of the wanted item
  • sr (int or None) – the sampling rate at which the audio will be returned (if needed, a resampling is performed). If None, no resampling is performed
Returns:

  • mix (numpy.ndarray) – the audio waveform of the mixed song
  • int – The sampling rate of the audio array

get_mix_paths(idx) → List[str][source]

Return paths to the mixed recording if available

Returns list of string (usually only one)

get_pianoroll(idx, score_type=['misaligned'], resolution=0.25, onsets=False, velocity=True)[source]

Create pianoroll from list of pitches, onsets and offsets (in this order).

Parameters:
  • idx (int) – The index of the song to retrieve.
  • score_type (list of str) – The key to retrieve the list of notes from the ground_truths. see chose_score_type for explanation
  • resolution (float) – The duration of each column (in seconds)
  • onsets (bool) – If True, the value ‘-1’ is put sn each onset
  • velocity (bool) – if True, values of each note is the velocity (except the first frame if onsets is used)
Returns:

A (128 x n) array where rows represent pitches and columns are time instants sampled with resolution provided as argument.

Return type:

numpy.ndarray

Note

In the midi.org standard, pitches start counting from 0; however, sometimes people use to count pitches from 1. Depending on the dataset that you are using, verify how pitches are counted. In the ASMD default ground-truths, pitches are set with 0-based indexing.

In case your dataset does not start counting pitches from 0, you should correct the output of this function.

get_score_duration(idx)[source]

Returns the duration of the most aligned score available for a specific item

get_songs()[source]

Returns a list of dict, each representing a song

get_source(idx)[source]

Returns the sources at the specified index

Parameters:idx (int) – the index of the wanted item
Returns:
  • list – a list of numpy.ndarray representing the audio of each source
  • int – The sampling rate of the audio array
get_sources_paths(idx) → List[str][source]

Return paths to single-sources audio recordings, one for each audio

Returns list of string

idx_chunk_to_whole(name, idx)[source]

Given a dataset name and an idx or a list of idx relative to the input dataset, returns the idx relative to this whole dataset.

Use this method if you need, for instance the index of a song for which you have the index in a single dataset.

parallel(func, *args, **kwargs)[source]

Applies a function to all items in paths in parallel using joblib.Parallel.

You can pass any argument to joblib.Parallel by using keyword arguments.

Parameters:func (callable) –

the function that will be called; it must accept two arguments that are the index of the song and the dataset. Then, it can accept all args and kwargs that are passed to this function:

>>>  def myfunc(i, dataset, pinco, pal=lino):
...     # do not use `filter` and `chunks` here
...     print(pinco, pal)
...     print(dataset.paths[i])
... marco, etto = 4, 5
... d = Dataset().filter(datasets='Bach10')
... d.parallel(myfunc, marco, n_jobs=8, pal=etto)

filter and chunks shouldn’t be used.

Returns:The list of objects returned by each func
Return type:list
asmd.dataset_utils._check_consistency(dataset, fix=False)[source]

Checks that is a dataset is included, then at least one of its songs is included and that if a dataset is excluded, then all of its songs are excluded.

If fix is True, if fixes the dataset inclusion, otherwise raise a RuntimeError

asmd.dataset_utils._compare_dataset(compare_func, dataset1, dataset2, **kwargs)[source]

Returns a new dataset where each song and dataset are included only if compare_func is True for each corresponding couplke of songs and datasets

asmd.dataset_utils.choice(dataset, p=[0.6, 0.2, 0.2], random_state=None)[source]

Returns N non-overlapping datasets randomply sampled from dataset, where N is len(p); each song belong to a dataset according to the distribution probability p. Note that p is always normalized to sum to 1.

random_state is an int or a np.random.RandomState object.

asmd.dataset_utils.chose_score_type(score_type, gts)[source]

Return the proper score type according to the following rules

Parameters:
  • score_type (list of str) – The key to retrieve the list of notes from the ground_truths. If multiple keys are provided, only one is retrieved by using the following criteria: if there is precise_alignment in the list of keys and in the ground truth, use that; otherwise, if there is broad_alignment in the list of keys and in the ground truth, use that; otherwise if misaligned in the list of keys and in the ground truth, use use score.
  • gts (list of dict) – The list of ground truths from which you want to chose a score_type
asmd.dataset_utils.complement(dataset, **kwargs)[source]

Takes one dataset and returns a new dataset representing the complement of the input

This functions calls filter to populate the paths and returns them woth all the sources. However, you can pass any argument to filter, e.g. the sources argument

asmd.dataset_utils.filter(dataset, instruments=[], ensemble=None, mixed=True, sources=False, all=False, composer='', datasets=[], groups=[], ground_truth=[], copy=False)[source]

Filter the paths of the songs which accomplish the filter described in kwargs. If this dataset was already fltered, only filters those paths that are already included.

For advanced usage:

So that a dataset can be filtered, it must have the following keys:

  • songs
  • name
  • included

All the attributes are checked at the song level, except for:

  • ensemble: this is checked at the dataset-level (i.e. each dataset can be for ensemble or not) This may change in future releases
  • ground_truth: this is checked at group level (i.e. each subgroup can have different annotations)

Similarly, each song must have the key included and optionally the other keys that you want to filter, as described by the arguments of this function.

Parameters:
  • instruments (list of str) – a list of strings representing the instruments that you want to select (exact match with song)
  • ensemble (bool) – if loading songs which are composed for an ensemble of instrument. If None, ensemble field will not be checked and will select both (default None)
  • mixed (bool) – if returning the mixed track for ensemble song (default True )
  • sources (bool) – if returning the source track for ensemble recording which provide it (default False )
  • all (bool) – only valid if sources is True : if True , all sources (audio and ground-truth) are returned, if False, only the first target instrument is returned. Default False.
  • composer (string) – the surname of the composer to filter
  • groups (list of strings) – a list of strings containing the name of the groups that you want to retrieve with a logic ‘AND’ among them. If empty, all groups are used. Example of groups are: ‘train’, ‘validation’, ‘test’. The available groups depend on the dataset. Only Maestro dataset supported for now.
  • datasets (list of strings) – a list of strings containing the name of the datasets to be used. If empty, all datasets are used. See License for the list of default datasets. The matching is case insensitive.
  • ground_truth (dict[str, int]) – a dictionary (string, int) representing the type of ground-truths needed (logical AND among list elements). Each entry has the form needed_ground_truth_type as key and level_of_truth as value, where needed_ground_truth_type is the key of the ground_truth dictionary and level_of_truth is an int ranging from 0 to 2 (0->False, 1->True (manual annotation), 2->True(automatic annotation)). If only part of a dataset contains a certain ground-truth type, you should use the group attribute to only select those songs.
  • copy (bool) – If True, a new Dataset object is returned, and the calling one is leaved untouched
Returns:

  • The input dataset as modified (d = Dataset().filter(…))
  • If copy is True, return a new Dataset object.

asmd.dataset_utils.get_pedaling_mat(dataset, idx, frame_based=False, winlen=0.046, hop=0.01)[source]

Get data about pedaling

Parameters:
  • idx (int) – The index of the song to retrieve.
  • frame_based (bool) – If True, the output will contain one row per frame, otherwise one row per control changes event. Frames are deduced from winlen and hop.
  • winlen (float) – The duration of a frame in seconds; only used if frame_based is True.
  • hop (float) – The amount of hop-size in seconds; only used if frame_based is True.
Returns:

list of 2d-arrays each listing all the control changes events in a track. Rows represent control changes or frames (according to frame_based_option) while columns represent (time, sustain value, sostenuto value, soft value).

If frame_based is used, time is the central time of the frame and frames are computed using the most aligned score available for this item.

If frame_based is False, value -1 is used for pedaling type not affected in a certain control change (i.e. a control change affects one type of pedaling, so the other two will have value -1).

The output is sorted by time.

Return type:

list[np.ndarry]

asmd.dataset_utils.get_score_mat(dataset, idx, score_type=['misaligned'], return_notes='')[source]

Get the score of a certain score, with times of score_type

Parameters:
  • idx (int) – The index of the song to retrieve.
  • score_type (list of str) – The key to retrieve the list of notes from the ground_truths. see chose_score_type for explanation
  • return_notes (str) – 'missing', 'extra' or 'both'; the notes that will be returned together with the score; see asmd.asmd.Dataset.get_missing_extra_notes for more info
Returns:

  • numpy.ndarray – A (n x 6) array where columns represent pitches, onsets (seconds), offsets (seconds), velocities, MIDI program instrument and number of the instrument. Ordered by onsets. If some information is not available, value -255 is used. The array is sorted by onset, pitch and offset (in this order)
  • numpy.ndarray – A boolean array with True if the note is missing or extra (depending on return_notes); only if return_notes is not None
  • numpy.ndarray – Another boolean array with True if the note is missing or extra (depending on return_notes); only if return_notes == 'both'

asmd.dataset_utils.intersect(*datasets, **kwargs)[source]

Takes datasets and returns a new dataset representing the intersection of them The datasets must have the same order in the datasets and songs (e.g. two datasets initialized in the same way and only filtered)

This functions calls filter to populate the paths and returns them woth all the sources. However, you can pass any argument to filter, e.g. the sources argument

asmd.dataset_utils.union(*datasets, **kwargs)[source]

Takes datasets and returns a new dataset representing the union of them The datasets must have the same order in the datasets and songs (e.g. two datasets initialized in the same way and only filtered)

This functions calls filter to populate the paths and returns them woth all the sources. However, you can pass any argument to filter, e.g. the sources argument

General Utilities

asmd.utils.f0_to_midi_pitch(f0)[source]

Return a midi pitch (in 0-127) given a frequency value in Hz

asmd.utils.frame2time(frame: int, hop_size=3072, win_len=4096) → float[source]

Takes frame index (int) and returns the corresponding central sample The output will use the same unity of measure as hop_size and win_len (e.g. samples or seconds). Indices start from 0.

Returns a float!

asmd.utils.mat2midipath(mat, path)[source]

Writes a midi file from a mat like asmd:

pitch, start (sec), end (sec), velocity

If mat is empty, just do nothing.

asmd.utils.mat_stretch(mat, target)[source]

Changes times of mat in-place so that it has the same average BPM and initial time as target.

Returns mat changed in-place.

asmd.utils.midi_pitch_to_f0(midi_pitch)[source]

Return a frequency given a midi pitch (in 0-127)

asmd.utils.midipath2mat(path)[source]

Open a midi file with one instrument track and construct a mat like asmd:

pitch, start (sec), end (sec), velocity

Rows are sorted by onset, pitch and offset (in this order)

asmd.utils.nframes(dur, hop_size=3072, win_len=4096) → float[source]

Compute the numbero of frames given a total duration, the hop size and window length. Output unitiy of measure will be the same as the inputs unity of measure (e.g. samples or seconds).

N.B. This returns a float!

asmd.utils.open_audio(audio_fn: Union[str, pathlib.Path]) → Tuple[numpy.ndarray, int][source]

Open the audio file in audio_fn and returns a numpy array containing it, one row for each channel (only Mono supported for now) and the orginal sample_rate

asmd.utils.time2frame(time, hop_size=3072, win_len=4096) → int[source]

Takes a time position and outputs the best frame representing it. The input must use the same unity of measure for time, hop_size, and win_len (e.g. samples or seconds). Indices start from 0.

Returns and int!

Utilities for statistical analysis

Scientific notes

Artificial misalignment

This dataset tries to overcome the problem of needing manual alignment of scores to audio for training models which exploit audio and scores at the both time. The underlying idea is that we have many scores and a lot of audio and users of trained models could easily take advantage of such multimodality (the ability of the model to exploit both scores and audio). The main problem is the annotation stage: we have quite a lot of aligned data, but we miss the corresponding scores, and if we have the scores, we almost always miss the aligned performance.

The approach used is to statistical analyze the available manual annotations and to reproduce it. Indeed, with misaligned data I mean data which try to reproduce the statistical features of the difference between scores and aligned data.

New description

You can evaluate the various approach by running python -m asmd.alignment_stats. The script will use Eita Nakamura method to match notes between the score and the performance and will collect statistics only on the matched notes; it will then compute the distance between the misaligned score onset/offset sequence and the real score onset sequence, considering only the matchng notes, using the L1 error between matching notes. The evaluation uses vienna_corpus, traditional_flute, MusicNet, Bach10 and asap group from Maestro dataset for a total of 875 scores, split in train-set and test-set with 70-30 proportion, resulting in 641 songs for training and 234 songs for testing.

However, since Eita’s method takes a long time on some scores, I removed the scores for which Eita’s method ends after 20 seconds; this resulted in a total of 347 songs for training and ~143 songs for testing (~54% and ~61% of the total number of songs with an available score).

Both the two compared methods are based on the random choice of a standard deviation and a mean for the whole song according to the collected distributions of standard deviations and means. Statistics are collected for onsets differences and duration ratios between performance and score. After the estimation of new onsets and offsets, onsets a sorted and offsets are made lower than the next onsets with the same pitch.

The two methods differ for how the standardized misalignment is computed/generated:

  • old method randomly choses it according to the collected distribution
  • new method uses an HMM with Gaussian mixture emissions instead of a simple distribution

Moreover, the misaligned data are computed with models trained on the stretched scores, so that the training data consists of scores at the same average BPM as the performance; the misaligned data, then, consists of times at that average BPM.

The following table resumes the results of the comparison:

  Ons Offs
HMM 18.6 ± 49.7 20.7 ± 50.6
Hist 7.43 ± 15.5 8.95 ± 15.5

Misaligned data are finally created by training Histogram on all the 875 scores (~481 considering songs where Eita’s method takes less than 20 sec). Misaligned data are more similar to a new performance than to a symbolic score; for most of MIR applications, however, misaligned data are enough for both training and evaluation.

BPM for score alignment

Previously, the BPM was always forced to 20, so that, if the BPM is not available, notes duration can still be expressed in seconds.

Since 0.5, the BPM is simply set to 60 if not available; however, positions of beats are always provided, so that the user can reconstruct the instant BPM. The function get_initial_bpm from the Python API also provides a way to retrieve the initial instant BPM from the score.

An easy way to get an approximative BPM, is to stretch the score to the duration of the corresponding performance. This can also be done for the beats, and, consequently, for the instant BPM. For instance, let T_0 and T_1 be the initial and ending time of the performance, and let t_0 and t_1 be the initial and ending times of the score. Then, the stretched times of the score at the average performance BPM are given by:

t_new = (t_old - t_0) * (T_1 - T_0) + T_0

where t_old is an original time instant in the score and t_new is the new time instant after the stretching. Applying this formula to the beat times can help you to compute the new instant BPM while keeping the same average BPM as the performance. This functionality is provided by asmd.utils.mat_stretch for onsets and offsets, but not for beats yet.

License

Ground-truth annotations

All ground-truth annotations we used are originally released under Creative Commons licenses. We release our adaptations under Creative Commons.

They are retrieved starting from the following projects:

Datasets Name used in the default definitions
Bach10 Bach10
Maestro Maestro
MusicNet MusicNet
PHENICX - Anechoic PHENICX
Saarland Music Dataset (SMD) SMD
Traditional flute dataset traditional_flute
TRIOS dataset TRIOS (removed: link is dead)
Vienna Corpus vienna_corpus

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Code

All the code is released under MIT license:

Copyright 2020 Federico Simonetta https://federicosimonetta.eu.org

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This paper describes an open-source Python framework for handling datasets for music processing tasks, built with the aim of improving the reproducibility of research projects in music computing and assessing the generalization abilities of machine learning models. The framework enables the automatic download and installation of several commonly used datasets for multimodal music processing. Specifically, we provide a Python API to access the datasets through Boolean set operations based on particular attributes, such as intersections and unions of composers, instruments, and so on. The framework is designed to ease the inclusion of new datasets and the respective ground-truth annotations so that one can build, convert, and extend one’s own collection as well as distribute it by means of a compliant format to take advantage of the API. All code and ground-truth are released under suitable open licenses.

For a gentle introduction, see our paper [1]

TODO

  1. add automatic matching of songs among multiple datasets based on metadata (and maybe audio ID?)
  2. change the filter function for each level of filtering which takes keyword and value and filter that keyword at that level
  3. describe datasets provided by default
  4. generic description of the framework
  5. improve “adding_datasets” with a full example
  6. add section “examples”
  7. move wget to curl
  8. support Windows systems

Cite us

[1] Simonetta, Federico ; Ntalampiras, Stavros ; Avanzini, Federico: ASMD: an automatic framework for compiling multimodal datasets. In: Proceedings of the 17th Sound and Music Computing Conference. Torino, 2020 arXiv:2003.01958

Federico Simonetta