NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT) has been actively engaging in collaborations with private, non-federal partners through Space Act Agreements. One such collaboration with International Business Machines (IBM) has resulted in the development of INDUS, a suite of large language models (LLMs) customized for the domains of Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics. This partnership has led to significant advancements in the field of natural language processing.

The INDUS suite contains encoders and sentence transformers specifically tailored to scientific domains. These encoders were trained on a vast corpus of 60 billion tokens covering various scientific fields. The custom tokenizer developed by the IMPACT-IBM team has improved the recognition of scientific terms, making INDUS unique and highly specialized. The sentence transformer models were fine-tuned on millions of text pairs, resulting in superior performance in benchmarks related to biomedical tasks, scientific question-answering, and Earth science entity recognition.

INDUS is designed to handle diverse linguistic tasks and retrieval augmented generation. It can process research inquiries, retrieve relevant documents, and generate precise answers. The development of smaller, faster versions of the encoder and sentence transformer models caters to latency-sensitive applications. Validation tests have demonstrated INDUS’s excellence in retrieving pertinent information from scientific corpora, showcasing its potential in enhancing scientific research efficiency.

The collaboration with IBM led to the evaluation of INDUS using data from NASA’s Biological and Physical Sciences (BPS) Division. The integration of INDUS with the Open Science Data Repository (OSDR) API paved the way for the development of a chatbot that simplifies dataset navigation. Further integration of INDUS at the NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC) has resulted in improved categorization of publications citing GES-DISC data.

NASA’s Science Discovery Engine (SDE) has successfully prototyped the integration of INDUS into its search engine, improving the accuracy and relevancy of search results. The implications of INDUS extend to enhancing the user experience by streamlining data retrieval and offering insights into new research directions. Additionally, the models developed through this collaboration are openly available on platforms like Hugging Face, aligning with NASA and IBM’s commitment to transparent artificial intelligence.

The collaboration between NASA and IBM has significantly contributed to the advancement of language models tailored for scientific domains. The development of INDUS has not only revolutionized natural language processing in the scientific community but has also enhanced the accessibility and efficiency of scientific research. By offering specialized tools and openly sharing their models, NASA and IBM are fostering innovation and collaboration in the field of artificial intelligence.

Technology

Articles You May Like

Shifting Leadership at Boeing: A Critical Response to Ongoing Challenges
Evaluating Food Waste Policies: A Deep Dive into Massachusetts’ Success and the Challenges Faced by Other States
Understanding the Mechanics of Topological Protection in Quantum Anomalous Hall Insulators
The Stakes of Floodplain Development: A U.S. Community Perspective

Leave a Reply

Your email address will not be published. Required fields are marked *