December 6, 2025
Journal Article
The Power of Many: An Ensemble Approach to Spectral Similarity
Abstract
Quantifying the similarity between two spectra – a known reference spectrum and an unidentified sample query spectrum – is at the heart of compound identification workflows in Gas Chromatography Mass Spectrometry (GC-MS). The reference spectrum most like the query is assigned as its identification, and thus accurately measuring similarity is essential. Significant research has gone towards developing metrics for this purpose, each of which has attempted to improve upon existing methods by incorporating GC-MS specific information (e.g. peak ratios or retention times) or adopting various statistical and algorithmic frameworks. While this active development has led to a plethora of similarity metrics with demonstrated value across different contexts, the unfortunate consequence has been a confusion surrounding which metric should be used as a global standard. No such metric is currently accepted as the standard method, because different metrics have demonstrated optimal performance in different contexts. In this work, we propose an ensemble approach to spectral similarity scoring that combines the collective information from across existing similarity metrics to form an improved, globally representative similarity metric as a step towards establishing a global standard method. The resulting ensemble metrics are evaluated on over 88,000 spectra of varying complexity and demonstrate improved abilities to accurately rank the correct reference spectrum as the top-matching candidate for a query relative to the rankings generated by individual similarity scores.Published: December 6, 2025