MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata
MoleculeMiner: Extracting and Linking Molecule Figures with Tabular Metadata
Abhisek Dey, Nathaniel H. Stanley
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Demo Track. Pages 11034-11038.
https://doi.org/10.24963/ijcai.2025/1257
Despite an ongoing shift in automated chemical literature search methods, many are fairly limited in ability to find very specific relevant information about a drawn molecule and its associated property data. We aim to tackle the challenge of converting drawn molecules to a machine readable representation and co-reference any associated molecule data. MoleculeMiner is a system where a user can feed in their own patent or paper to obtain each drawn molecule along with any specific metadata (chemical name, chemical reactivity, yield, purity etc.) provided anywhere in the PDF in a tabular format, using an interactive user-friendly environment. We also present MolScribeV2, a molecular image parser which improved upon the original MolScribe by introducing pixel-based self attention positional embedding technique. Along with other changes, MolScribeV2 is robust to varied styles of compound drawings commonly found in patents and papers--scanned or born digital. Our extraction and user interactive system can be found at https://github.com/insitro/MoleculeMiner.
Keywords:
Multidisciplinary Topics and Applications: MTA: Bioinformatics
Knowledge Representation and Reasoning: KRR: Applications
Computer Vision: CV: Structural and model-based approaches, knowledge representation and reasoning
Computer Vision: CV: Recognition (object detection, categorization)
