Applying Object Detection to Automatic Drum Transcription – American Journal of Student Research

American Journal of Student Research

Applying Object Detection to Automatic Drum Transcription

Publication Date : Sep-19-2025

DOI: 10.70251/HYJR2348.35282287


Author(s) :

Dylan Li.


Volume/Issue :
Volume 3
,
Issue 5
(Sep - 2025)



Abstract :

Automatic music transcription (AMT) is a fundamental problem in music information retrieval (MIR), involving the conversion of audio recordings into symbolic representations such as MIDI. This study presents a novel approach to automatic drum transcription (ADT) by reframing it as a computer vision object detection problem. Using the YOLO11 model, drum notes were predicted and transcribed with bounding boxes in spectrograms generated from the Expanded Groove MIDI Dataset (E-GMD). Two-second audio segments were extracted via a sliding window, converted into 640×640 grayscale spectrograms, and annotated with bounding boxes corresponding to onset times and instrument classes. The model achieves strong detection performance, with results mAP@0.5 of 0.943, precision of 0.892, and recall of 0.846. Results demonstrate YOLO11’s ability to handle polyphonic, temporally dense drum passages without explicit onset separation. This work highlights the potential of adapting computer vision techniques to audio-based event detection, paving the way for broader MIR applications beyond percussion, such as multi-instrument transcription and real-time performance analysis.