Paper dossier

The GigaMIDI Dataset with Features for Expressive Music Performance Detection

Detail viewSimilarity handoff

Review source metadata, abstract, authors, topics, and local similarity context before moving into explanation and ranking views.

Paper year

2025

Citations

2

Authors

6

Topic labels

3

Source readout

Source and corpus status

Venue

Transactions of the International Society for Music Information Retrieval

Source slug

tismir

Corpus placement

Core corpus

Similarity rows

6

Ranking readout

Where this paper lands in the current run

Run shadow-generalization-product-candidate-ranking-v1Top 50 surfaced

This block uses the same resolved ranking run as Recommended. Ranks here are materialized paper_scores ranks; live Emerging may be reordered by the bounded ML scorer. Family rank is global within each family, but rank is only shown when this paper lands inside the surfaced top 50.

Families present

3

Top 50

3

Run label

shadow-generalization-product-candidate-ranking-v1

Snapshot

source-snapshot-shadow-generalization-v1-20260521

Scope: family global | run rank-83787b91ef

Emerging

In top 50 at rank 18

0.475

Emerging: embedding slice fit vs included-corpus centroid (title+abstract), plus citation velocity and topic growth; not universal relevance. Bridge signal not used here.

Signals: semantic=0.8500, citation_velocity=0.1200, topic_growth=0.8160, diversity_penalty=0.0000

Why this surfaced | 3 used | 1 penalty | 1 not computed
Embedding slice fit (corpus centroid)used

Embedding slice fit (corpus centroid): high; used in final ranking (contribution to score: 0.1700)

Recent attentionused

Recent attention: low; used in final ranking (contribution to score: 0.0600)

Topic momentumused

Topic momentum: high; used in final ranking (contribution to score: 0.2448)

Cross-cluster signalnot computed

Cross-cluster signal: not computed for this run

Similarity penaltypenalty

Similarity penalty: reduces score when non-zero (contribution to score: 0.0000)

Bridge

In top 50 at rank 17

0.572

Multi-topic paper in active topics; no cluster_version on this run so bridge_score was not computed.

Signals: citation_velocity=0.1200, topic_growth=0.8160, diversity_penalty=0.0000

Why this surfaced | 2 used | 1 penalty | 2 not computed
Semantic matchnot computed

Semantic match: not computed for this run

Recent attentionused

Recent attention: low; used in final ranking (contribution to score: 0.0420)

Topic momentumused

Topic momentum: high; used in final ranking (contribution to score: 0.5304)

Cross-cluster signalnot computed

Cross-cluster signal: not computed for this run

Topic breadth penaltypenalty

Topic breadth penalty: reduces score when non-zero (contribution to score: 0.0000)

Under-cited

In top 50 at rank 47

0.497

Low-cite candidate pool (see docs/candidate-pool-low-cite.md v0): core corpus, recency floor, citation ceiling, title+abstract gate; popularity penalty among pool members only. Semantic and bridge not yet modeled.

Signals: citation_velocity=0.1200, topic_growth=0.8160, diversity_penalty=0.4421

Why this surfaced | 2 used | 1 penalty | 2 not computed
Semantic matchnot computed

Semantic match: not computed for this run

Recent attentionused

Recent attention: low; used in final ranking (contribution to score: 0.0360)

Topic momentumused

Topic momentum: high; used in final ranking (contribution to score: 0.5712)

Cross-cluster signalnot computed

Cross-cluster signal: not computed for this run

Pool popularity penaltypenalty

Pool popularity penalty: reduces score when non-zero (contribution to score: -0.1105)

Abstract

The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit music information retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the largest collection of symbolic music in MIDI format available for research purposes under fair dealing. Distinguishing between non‑expressive and expressive MIDI tracks is challenging, as MIDI files do not inherently make this distinction. To address this issue, we introduce a set of innovative heuristics for detecting expressive music performance. These include the distinctive note velocity ratio (DNVR) heuristic, which analyzes MIDI note velocity; the distinctive note onset deviation ratio (DNODR) heuristic, which examines deviations in note onset times; and the note onset median metric level (NOMML) heuristic, which evaluates onset positions relative to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between non‑expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most substantial expressive MIDI dataset, employing our heuristic NOMML. This curated iteration of GigaMIDI encompasses expressively performed instrument tracks detected by NOMML, containing all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totaling 1,655,649 tracks.

Authors

  • Keon Ju Maverick Lee
  • Jeff Ens
  • Sara Adkins
  • Pedro Sarmento
  • Mathieu Barthet
  • Philippe Pasquier

Neighborhood labels

Topics

3 labels

Topic labels are imported metadata and can be noisy; use them as coarse navigation hints, not authoritative classifications.

Music and Audio ProcessingMusic Technology and Sound StudiesNeuroscience and Music Perception

Neighbor surface

Similar papers

6 total neighborsEmbedding v1-title-abstract-1536-cleantext-r3

Similar papers use a separately configured neighbor embedding; it may differ from the embedding version used by the current ranked run.

Next handoff

Best next moves from here

01

Check recommendation families

Use Recommended to see whether this paper behaves like an emerging or undercited signal in the current ranked feed, or how it appears on the bridge preview / diagnostics view.

02

Inspect nearby topics

Use Trends to understand whether its attached labels are heating up or cooling down inside the curated corpus.

03

Cross-check evaluation baselines

Use Evaluation to compare the dossier readout against citation and recency baselines for the same resolved family run.