In this study, the author addresses the challenge of efficiently identifying and extracting material information from annual 10-K reports, which are required by federal securities law for all public companies. These reports can be hundreds of pages long, making it difficult for human readers to sift through and identify the relevant information. To tackle this problem, the author proposes using fine-tuned BERT models and RNN models with LSTM layers to identify stakeholder-material information. This type of information refers to statements that provide insights into a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment. The existing practice for identifying stakeholder-material information involves using keyword search as a baseline model. However, the author's approach aims to improve upon this baseline by leveraging machine learning techniques. To train and evaluate their models, the author used business expert-labeled training data consisting of approximately 6,000 sentences extracted from 62 10-K reports published in 2022. The best-performing model achieved an accuracy of 0.904 and an F1 score of 0.899 on test data. These results were significantly better than those obtained by the baseline model (accuracy: 0.781; F1 score: 0.749). Furthermore, the study replicated the same work on more granular taxonomies, focusing on four distinct groups of stakeholders: customers, investors, employees, and the community/natural environment. Once again, fine-tuned BERT models outperformed LSTM models as well as the baseline. The implications of this research extend beyond academia and have practical applications in various industries where analyzing large volumes of textual data is necessary for decision-making processes. The findings suggest that utilizing fine-tuned BERT models can greatly enhance the efficiency and accuracy of extracting stakeholder-material information from annual reports. In terms of future extensions to this work, there are several potential avenues for exploration. For instance, the models could be further fine-tuned using larger and more diverse datasets to improve their performance. Additionally, incorporating other advanced natural language processing techniques or exploring different architectures could lead to even better results.
- - The challenge of efficiently identifying and extracting material information from annual 10-K reports
- - Proposal to use fine-tuned BERT models and RNN models with LSTM layers for identification of stakeholder-material information
- - Stakeholder-material information refers to insights into a company's influence on stakeholders
- - Existing practice involves keyword search, but the author's approach leverages machine learning techniques
- - Training and evaluation of models using expert-labeled training data from 62 10-K reports in 2022
- - Best-performing model achieved an accuracy of 0.904 and an F1 score of 0.899, outperforming the baseline model
- - Replication of work on more granular taxonomies focusing on customers, investors, employees, and community/natural environment stakeholders
- - Fine-tuned BERT models outperformed LSTM models and the baseline in this replication as well
- - Practical applications in industries where analyzing large volumes of textual data is necessary for decision-making processes
- - Future extensions could include further fine-tuning with larger and more diverse datasets, incorporating other NLP techniques, or exploring different architectures
Summary- The challenge is to find important information from annual reports.
- They want to use special models to find this information.
- Stakeholder-material information means knowing how a company affects people.
- They used machine learning to train the models.
- The best model did a good job finding the information.
Definitions- Annual 10-K reports: These are documents that companies make every year to show how they are doing.
- Fine-tuned BERT models and RNN models with LSTM layers: These are special computer programs that can help find important information in the reports.
- Stakeholder-material information: This means understanding how a company affects different groups of people who care about it.
- Machine learning techniques: This is when computers learn things on their own by looking at lots of examples.
- Training data: This is information that is used to teach the computer program what to look for in the reports.
- Accuracy and F1 score: These are ways to measure how well the computer program did at finding the information. A higher number means it did better.
- Baseline model: This is a basic version of the computer program that they compared their new models to.
- Granular taxonomies: This means looking at more specific categories or groups of people, like customers, investors, employees, and community/natural environment stakeholders.
- NLP techniques: This stands for natural language processing, which is when computers understand and work with human language.
Efficiently Extracting Stakeholder-Material Information from 10-K Reports Using Machine Learning
In the modern business world, companies are required to submit annual 10-K reports to the federal securities law. These reports can be hundreds of pages long and contain a wealth of information about a company's operations and performance. However, it is often difficult for human readers to sift through all this data and identify the relevant stakeholder-material information. This type of information refers to statements that provide insights into a company's influence on its stakeholders, including customers, employees, investors, and the community and natural environment.
To address this challenge, researchers have proposed using machine learning techniques such as fine-tuned BERT models and RNN models with LSTM layers to efficiently identify stakeholder-material information from 10-K reports. In this study, we evaluate these approaches by training them on business expert-labeled training data consisting of approximately 6,000 sentences extracted from 62 10-K reports published in 2022. We compare our results against those obtained by using keyword search as a baseline model.
Results
The best performing model achieved an accuracy of 0.904 and an F1 score of 0.899 on test data – significantly better than those obtained by the baseline model (accuracy: 0.781; F1 score: 0.749). Furthermore, when we replicated the same work on more granular taxonomies focusing on four distinct groups of stakeholders (customers, investors, employees & community/natural environment), fine tuned BERT models outperformed LSTM models as well as the baseline again – demonstrating their effectiveness in extracting stakeholder material information from annual reports quickly & accurately at scale without manual intervention or keyword searches which tend to be time consuming & error prone processes for humans due to large volumes of textual data present in these documents .
Implications
The findings suggest that utilizing fine tuned BERT models can greatly enhance efficiency & accuracy while extracting stakeholder material information from annual report filings – making it easier for businesses & organizations alike to make informed decisions based off such documents quickly & accurately at scale without having to manually go through each document page by page or use unreliable keyword searches which may not always yield accurate results due to potential errors in spelling or context etc . The implications extend beyond academia into various industries where analyzing large volumes of textual data is necessary for decision making processes .
Future Extensions
There are several potential avenues for exploration when it comes future extensions related to this research . For instance , one could further fine tune existing models using larger datasets with more diverse content so as improve their performance even further . Additionally , incorporating other advanced natural language processing techniques or exploring different architectures could lead even better results too .