Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

AI-generated keywords: Text document similarity Term weighting TF-IDF Synonym recognition Kazakh language

AI-generated Key Points

  • Determining text document similarity is important in various fields such as Information Retrieval, Text Mining, NLP, and Computational Linguistics.
  • Transferring data into numeric vectors involves complex algorithms like tokenization, stopword filtering, stemming, and term weighting.
  • Term frequency - inverse document frequency (TF-IDF) is a commonly used technique for term weighting to find relevant documents.
  • An extension of TF-IDF that considers synonyms is proposed to improve text document similarity measurement for the Kazakh language.
  • The proposed method is evaluated using functions like Cosine, Dice, and Jaccard to quantify similarity between Kazakh text documents.
  • Previous research introduced methods like Synonyms-Depending Term weighting scheme (SBT) and synonym recognition within documents with TF/IDF measure.
  • The modified TF-IDF method incorporating synonyms aims to provide more accurate results when measuring document similarity in Kazakh text analysis.
  • This study contributes to advancing methods by integrating synonym information into traditional TF-IDF calculations for languages like Kazakh.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Bakhyt Bakiyev

2022 International Conference on Smart Information Systems and Technologies (SIST)
License: CC BY 4.0

Abstract: The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.

Submitted to arXiv on 22 Nov. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2211.12364v1

The task of determining the similarity of text documents has garnered significant attention in various fields such as Information Retrieval, Text Mining, Natural Language Processing (NLP), and Computational Linguistics. The process of transferring data into numeric vectors involves complex algorithms like tokenization, stopword filtering, stemming, and term weighting. Among these methods, the term frequency - inverse document frequency (TF-IDF) stands out as the most commonly used term weighting technique to aid in the search for relevant documents. To enhance the accuracy of term weighting, numerous extensions to TF-IDF have been developed. In this study, a novel extension of the TF-IDF method is proposed that takes synonyms into consideration when determining the similarity of text documents. This modification aims to improve the effectiveness of measuring text document similarity for the Kazakh language. The proposed method is evaluated through experiments using functions such as Cosine, Dice, and Jaccard to quantify the similarity between text documents written in Kazakh. Previous research by Kumar et al. focused on weighing terms based on synonyms for biomedical purposes. They introduced a Synonyms-Depending Term weighting scheme (SBT) that adjusts Inverse Document Frequency (IDF) based on clusters of synonyms associated with each term. Another study by Gulic et al. explored synonym recognition within documents and replaced them with general terms using a matcher that incorporates TF/IDF measure. The proposed method in this paper builds upon existing research by incorporating synonyms into the TF-IDF framework specifically tailored for analyzing text documents in Kazakh. By considering synonyms during term weighting, this approach aims to provide more accurate results when measuring document similarity. The performance of this modified TF-IDF method is compared with existing techniques to assess its effectiveness in enhancing text document analysis for the Kazakh language. Overall, this study contributes to advancing methods for determining text document similarity by integrating synonym information into traditional TF-IDF calculations, particularly beneficial for languages like Kazakh where synonyms play a crucial role in understanding textual content.
Created on 16 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.