Structured information extraction from complex scientific text with fine-tuned large language models

AI-generated keywords: Materials Science BERT Model GPT-3 NLP Information Extraction

AI-generated Key Points

  • Previous work by Huang & Cole fine-tuned a BERT model on battery publications to enhance a database of NLP-extracted battery data.
  • Their approach used a question and answer (Q/A) method for limited device-level information extraction.
  • Limitations of their approach: couldn't handle passages with multiple devices, required training the BERT language model on a large corpus of battery research papers.
  • This work proposes a simple and flexible approach for complex information extraction in scientific text.
  • The method involves fine-tuning the GPT-3 language model to perform document-level named entity recognition and relation extraction simultaneously.
  • The GPT-3 model generates precisely formatted summaries or structured schemas from text passages like research paper abstracts.
  • Researchers only need to define the desired output structure and annotate around 100-500 text passages to use this method.
  • The resulting fine-tuned model accurately extracts desired information from text in the same structured representation.
  • This approach demonstrates strong performance in both sentence-level and document-level materials information extraction tasks.
  • It requires minimal knowledge of how language models work internally, making it accessible even for researchers with little experience in NLP.
  • Intermediate models can be used to pre-suggest entities for annotation, speeding up the annotation process for constructing large training sets.
  • This method's generality suggests its applicability to other domains such as physics or biology.
  • Unlike previous methods relying on fine-tuning on large domain-specific corpora, this approach leverages comprehensive pretraining of language models along with user-provided annotations.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain

License: CC BY 4.0

Abstract: Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

Submitted to arXiv on 10 Dec. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2212.05238v1

In the domain of materials science, Huang & Cole previously fine-tuned a BERT model on battery publications to enhance a database of NLP-extracted battery data. Their approach used a question and answer (Q/A) method to extract limited device-level information. However, their approach had limitations as it couldn't be applied to passages containing information about multiple devices and required training the BERT language model on a large corpus of battery research papers. To address these limitations, this work proposes a simple and flexible approach for complex information extraction in scientific text. The method involves fine-tuning a large language model (GPT-3) to simultaneously perform document-level named entity recognition and relation extraction. This approach can handle complex inter-relations, including hierarchical or list-based information, without the need for enumerating all possible relations or preliminary named entity recognition. The GPT-3 model is trained to accept a text passage, such as a research paper abstract, and generate a precisely formatted summary of the knowledge contained in the prompt. The output can be in the form of English sentences or structured schemas like JSON documents. To use this method, researchers only need to define the desired output structure and annotate around 100-500 text passages accordingly. The resulting fine-tuned model can accurately extract desired information from text and provide output in the same structured representation. This approach demonstrates strong performance in both sentence-level and document-level materials information extraction tasks. Importantly, it requires minimal knowledge of how language models work internally, making it accessible even for researchers with little experience in natural language processing (NLP). Additionally, intermediate models can be used to pre-suggest entities for annotation, significantly speeding up the annotation process for constructing large training sets. While the example tasks presented focus on materials science, this method's generality suggests its applicability to other domains such as physics or biology. Notably, unlike previous methods that relied on fine-tuning on large domain-specific corpora, this approach leverages the comprehensive pretraining of language models along with user-provided annotations to accomplish a wide range of complex tasks. The authors also provide details on their contributions to the development and evaluation of the information extraction method.
Created on 08 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.