Finding Stakeholder-Material Information from 10-K Reports using Fine-Tuned BERT and LSTM Models

AI-generated keywords: Stakeholder-material information

AI-generated Key Points

  • The challenge of efficiently identifying and extracting material information from annual 10-K reports
  • Proposal to use fine-tuned BERT models and RNN models with LSTM layers for identification of stakeholder-material information
  • Stakeholder-material information refers to insights into a company's influence on stakeholders
  • Existing practice involves keyword search, but the author's approach leverages machine learning techniques
  • Training and evaluation of models using expert-labeled training data from 62 10-K reports in 2022
  • Best-performing model achieved an accuracy of 0.904 and an F1 score of 0.899, outperforming the baseline model
  • Replication of work on more granular taxonomies focusing on customers, investors, employees, and community/natural environment stakeholders
  • Fine-tuned BERT models outperformed LSTM models and the baseline in this replication as well
  • Practical applications in industries where analyzing large volumes of textual data is necessary for decision-making processes
  • Future extensions could include further fine-tuning with larger and more diverse datasets, incorporating other NLP techniques, or exploring different architectures
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Victor Zitian Chen

License: CC BY 4.0

Abstract: All public companies are required by federal securities law to disclose their business and financial activities in their annual 10-K reports. Each report typically spans hundreds of pages, making it difficult for human readers to identify and extract the material information efficiently. To solve the problem, I have fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information, defined as statements that carry information about a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment. The existing practice uses keyword search to identify such information, which is my baseline model. Using business expert-labeled training data of nearly 6,000 sentences from 62 10-K reports published in 2022, the best model has achieved an accuracy of 0.904 and an F1 score of 0.899 in test data, significantly above the baseline model's 0.781 and 0.749 respectively. Furthermore, the same work was replicated on more granular taxonomies, based on which four distinct groups of stakeholders (i.e., customers, investors, employees, and the community and natural environment) are tested separately. Similarly, fined-tuned BERT models outperformed LSTM and the baseline. The implications for industry application and ideas for future extensions are discussed.

Submitted to arXiv on 15 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.07522v1

In this study, the author addresses the challenge of efficiently identifying and extracting material information from annual 10-K reports, which are required by federal securities law for all public companies. These reports can be hundreds of pages long, making it difficult for human readers to sift through and identify the relevant information. To tackle this problem, the author proposes using fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information. This type of information refers to statements that provide insights into a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment. The existing practice for identifying stakeholder-material information involves using keyword search as a baseline model. However, the author's approach aims to improve upon this baseline by leveraging machine learning techniques. To train and evaluate their models, the author used business expert-labeled training data consisting of approximately 6,000 sentences extracted from 62 10-K reports published in 2022. The best-performing model achieved an accuracy of 0.904 and an F1 score of 0.899 on test data. These results were significantly better than those obtained by the baseline model (accuracy: 0.781; F1 score: 0.749). Furthermore, the study replicated the same work on more granular taxonomies, focusing on four distinct groups of stakeholders: customers, investors, employees, and the community/natural environment. Once again, fine-tuned BERT models outperformed LSTM models as well as the baseline. The implications of this research extend beyond academia and have practical applications in various industries where analyzing large volumes of textual data is necessary for decision-making processes. The findings suggest that utilizing fine-tuned BERT models can greatly enhance the efficiency and accuracy of extracting stakeholder-material information from annual reports. In terms of future extensions to this work, there are several potential avenues for exploration. For instance, the models could be further fine-tuned using larger and more diverse datasets to improve their performance. Additionally, incorporating other advanced natural language processing techniques or exploring different architectures could lead to even better results.
Created on 11 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.