Structured information extraction from complex scientific text with fine-tuned large language models
AI-generated Key Points
- Previous work by Huang & Cole fine-tuned a BERT model on battery publications to enhance a database of NLP-extracted battery data.
- Their approach used a question and answer (Q/A) method for limited device-level information extraction.
- Limitations of their approach: couldn't handle passages with multiple devices, required training the BERT language model on a large corpus of battery research papers.
- This work proposes a simple and flexible approach for complex information extraction in scientific text.
- The method involves fine-tuning the GPT-3 language model to perform document-level named entity recognition and relation extraction simultaneously.
- The GPT-3 model generates precisely formatted summaries or structured schemas from text passages like research paper abstracts.
- Researchers only need to define the desired output structure and annotate around 100-500 text passages to use this method.
- The resulting fine-tuned model accurately extracts desired information from text in the same structured representation.
- This approach demonstrates strong performance in both sentence-level and document-level materials information extraction tasks.
- It requires minimal knowledge of how language models work internally, making it accessible even for researchers with little experience in NLP.
- Intermediate models can be used to pre-suggest entities for annotation, speeding up the annotation process for constructing large training sets.
- This method's generality suggests its applicability to other domains such as physics or biology.
- Unlike previous methods relying on fine-tuning on large domain-specific corpora, this approach leverages comprehensive pretraining of language models along with user-provided annotations.
Authors: Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain
Abstract: Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Welcome to our AI assistant! Here are some important things to keep in mind:
- The assistant will only answer questions related to this specific paper.
- Please note that this is not a bot for casual chatting.
- If you want the answer in a language other than the language you chose for navigating the website, simply add "TRANSLATE IN LANGUAGE L" at the end of your query (replace "LANGUAGE L" with the language of your choice).
- For example, you could ask "Can you extract the most important aspect of the paper? TRANSLATE IN SPANISH".
- If you want to keep the history of your questions/answers you should create an account.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.