Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search

AI-generated keywords: Chemistry

AI-generated Key Points

  • Nearest neighbor-based similarity searching in chemistry is crucial for applications like drug discovery.
  • Traditional brute-force methods for this task can be computationally expensive and time-consuming due to the size of chemical databases.
  • Combining low-dimensional chemical embeddings with a k-d tree data structure can achieve rapid nearest neighbor queries while maintaining high performance on standard benchmarks.
  • SmallSA, a learned embedding generated using the SALSA framework trained on 1.4 million chemical compounds, significantly accelerates searches on large databases, executing queries on over one billion chemicals in less than a second on a single CPU core – five orders of magnitude faster than brute-force methods.
  • SmallSA achieves competitive performance on standard chemical similarity benchmarks compared to traditional fingerprinting methods like ECFPs and ECFCs, highlighting the potential of leveraging low-dimensional embeddings and k-d trees for efficient chemical similarity searching in drug discovery and other chemistry applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kathryn E. Kirchoff, James Wellnitz, Joshua E. Hochuli, Travis Maxfield, Konstantin I. Popov, Shawn Gomez, Alexander Tropsha

License: CC BY 4.0

Abstract: Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding -- SmallSA -- for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.

Submitted to arXiv on 12 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.07970v1

, , , , In the field of chemistry, nearest neighbor-based similarity searching is a crucial task, particularly in applications like drug discovery. Traditional approaches to this task often rely on brute-force methods, which can be computationally expensive and time-consuming due to the vast size of modern chemical databases. Previous advancements in computational techniques for this task have been limited in their generalizability and efficiency. In this study, we explore the potential of combining low-dimensional chemical embeddings with a k-d tree data structure to achieve rapid nearest neighbor queries while maintaining high performance on standard chemical similarity benchmarks. Our approach addresses challenges faced by existing tools such as ChemMine and FPSim2, which utilize partitioning schemes based on Baldi bounds but are limited by binary embeddings and unequal partition sizes. To overcome these limitations, we propose SmallSA, a learned embedding generated using the SALSA framework trained on 1.4 million chemical compounds from the Enamine catalog. We compare SmallSA with commonly used Extended-Connectivity Fingerprints (ECFPs) and Extended-Connectivity Count Vectors (ECFCs) as well as their low-dimensional versions. Our results demonstrate that our approach significantly accelerates searches on large chemical databases, executing queries on over one billion chemicals in less than a second on a single CPU core – five orders of magnitude faster than brute-force methods. Furthermore, SmallSA achieves competitive performance on standard chemical similarity benchmarks compared to traditional fingerprinting methods like ECFPs and ECFCs. This highlights the potential of leveraging low-dimensional embeddings and k-d trees for efficient chemical similarity searching in drug discovery and other applications within the field of chemistry.
Created on 02 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.