L+M-24: Building a Dataset for Language+Molecules @ ACL 2024

Carl Edwards; Qingyun Wang; Lawrence Zhao; Heng Ji

2024 ACL ACL 2024

L+M-24: Building a Dataset for Language+Molecules @ ACL 2024

Abstract

AbstractLanguage-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the L+M-24 dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — language-molecule model

🐣 Hot Topic Early Bird — dataset construction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Carl Edwards , Qingyun Wang , Lawrence Zhao , Heng Ji

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Natural Language Processing > Resources & Methods > Text Representation Machine Learning > Learning Types > Multi-Modal Learning Interdisciplinary > Science > Bioinformatics Deep Learning > Models > Multimodal Learning

Keywords

text representation natural language scientific literature dataset construction molecular discovery language-molecule model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024