Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

AI-generated keywords: Big Data Analytics Apache Spark Hadoop MapReduce Benchmarking Analysis Classification Tasks

AI-generated Key Points

  • The study by Taha Tekdogan and Ali Cakmak explores the evolution of Big Data analytics tools, focusing on Apache Spark and Hadoop MapReduce.
  • "Big Data Mining" involves using data mining techniques to extract insights from vast amounts of unstructured data.
  • Challenges arise in managing the shift from small, structured data to large volumes of unstructured and rapidly changing data.
  • Apache Spark outperforms Hadoop MapReduce by being five times faster in training models but may experience performance deterioration with larger workloads.
  • Hadoop MapReduce shows better accuracy scores in machine learning utility compared to Spark, especially in small datasets, but does not exhibit similar performance improvements when scaled.
  • Task-specific concerns should be considered when selecting a data management framework based on tailored performance metrics for classification tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Taha Tekdogan, Ali Cakmak

ICCBDC 2021. Association for Computing Machinery, New York, NY, USA, pages 15-20 (2021)
2021 5th International Conference on Cloud and Big Data Computing (ICCBDC 2021)
License: CC BY 4.0

Abstract: Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 3%, even in small size data sets.

Submitted to arXiv on 21 Sep. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2209.10637v1

In the study "Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification" by Taha Tekdogan and Ali Cakmak, the authors explore the evolution of Big Data analytics tools and their ability to handle vast amounts of unstructured data. The term "Big Data Mining" arises as data mining techniques are utilized to extract valuable insights from this massive pool of information. However, the shift from small, structured, and stable data to large volumes of unstructured and rapidly changing data presents numerous challenges in data management. To address this issue, the study focuses on two prominent Big Data analytics tools - Apache Spark and Hadoop MapReduce - and conducts a comprehensive benchmarking analysis on a common task in data mining: classification. Various evaluation metrics such as execution time, accuracy, and scalability are used to compare the performance of these frameworks specifically for classification tasks. The research highlights that Spark outperforms MapReduce by being five times faster in training models. However, it is noted that Spark's performance deteriorates with larger workloads but can be significantly enhanced by scaling the environment with additional clusters. On the other hand, while MapReduce exhibits better accuracy scores in machine learning utility compared to Spark (around 3% higher even in small datasets), it does not show similar performance improvements when scaled like Spark does. The authors emphasize the importance of considering task-specific concerns when selecting a suitable data management framework based on performance metrics tailored for classification tasks. This research provides valuable insights into the comparative analysis of Apache Spark and Hadoop MapReduce in handling Big Data classification tasks, shedding light on their strengths and limitations under varying workload conditions.
Created on 27 Feb. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.