BigBio: A Framework for Data-Centric Biomedical Natural Language Processing

Jason Fries; Leon Weber; Natasha Seelam; Gabriel Altay; Debajyoti Datta; Samuele Garda; Sunny Kang; Rosaline Su; Wojciech Kusa; Samuel Cahyawijaya; Fabio Barth; Simon Ott; Matthias Samwald; Stephen Bach; Stella Biderman; Mario Sänger; Bo Wang; Alison Callahan; Daniel León Periñán; Théo Gigant; Patrick Haller; Jenny Chim; Jose Posada; John Giorgi; Karthik Rangasai Sivaraman; Marc Pàmies; Marianna Nezhurina; Robert Martin; Michael Cullan; Moritz Freidank; Nathan Dahlberg; Shubhanshu Mishra; Shamik Bose; Nicholas Broad; Yanis Labrak; Shlok Deshmukh; Sid Kiblawi; Ayush Singh; Minh Chien Vu; Trishala Neeraj; Jonas Golde; Albert Villanova del Moral; Benjamin Beilharz

2022 NIPS NeurIPS 2022

BigBio: A Framework for Data-Centric Biomedical Natural Language Processing

Abstract

Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical

👥 Mega-Team — 43 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Healthcare & Medicine and Natural Language Processing

🧭 Keyword Pioneer — instruction tuning

🐣 Hot Topic Early Bird — instruction tuning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jason Fries , Leon Weber , Natasha Seelam , Gabriel Altay , Debajyoti Datta , Samuele Garda , Sunny Kang , Rosaline Su , Wojciech Kusa , Samuel Cahyawijaya , Fabio Barth , Simon Ott , Matthias Samwald , Stephen Bach , Stella Biderman , Mario Sänger , Bo Wang , Alison Callahan , Daniel León Periñán , Théo Gigant , Patrick Haller , Jenny Chim , Jose Posada , John Giorgi , Karthik Rangasai Sivaraman , Marc Pàmies , Marianna Nezhurina , Robert Martin , Michael Cullan , Moritz Freidank , Nathan Dahlberg , Shubhanshu Mishra , Shamik Bose , Nicholas Broad , Yanis Labrak , Shlok Deshmukh , Sid Kiblawi , Ayush Singh , Minh Chien Vu , Trishala Neeraj , Jonas Golde , Albert Villanova del Moral , Benjamin Beilharz

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Natural Language Processing > Resources & Methods > Large Language Models Healthcare & Medicine > Research > Bioinformatics Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning few-shot learning named entity recognition instruction tuning biomedical natural language processing large language model biomedical language

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022