Review Article | | Peer-Reviewed

A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics

Received: 29 September 2024     Accepted: 14 October 2024     Published: 12 November 2024
Views:       Downloads:
Abstract

The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.

Published in International Journal of Data Science and Analysis (Volume 10, Issue 5)
DOI 10.11648/j.ijdsa.20241005.11
Page(s) 86-99
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Parallel Processing Frameworks, Distributed Storage Frameworks, MapReduce, CUDA, Storm, Flink, MooseFS, BeeGFS

References
[1] Souibgui, M., Atigui, F., Zammali, S., Cherfi, S., & Yahia, S. B. (2019). Data quality in ETL process: A preliminary study. Procedia Computer Science, 159, 676-687.
[2] Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3(2), 625-650.
[3] Butenhof, D. R. (1993). Programming with POSIX threads. Addison-Wesley Professional.
[4] Shen, J. P., & Lipasti, M. H. (2013). Modern processor design: fundamentals of superscalar processors. Waveland.
[5] Culler, D., Singh, J. P., & Gupta, A. (1999). Parallel computer architecture: a hardware/software approach. Gulf Professional Publishing.
[6] Castelló, A., Gual, R. M., Seo, S., Balaji, P., Quintana-Orti, E. S., & Pena, A. J. (2020). Analysis of threading libraries for high performance computing. IEEE Transactions on Computers, 69(9), 1279-1292.
[7] Silberschatz, A., Galvin, P. B., & Gagne, G. (2012). Operating system concepts.
[8] OpenMP, A. R. B. (2013, July). OpenMP application program interface version 4.0. In The OpenMP Forum, Tech. Rep.
[9] Nielsen, F., & Nielsen, F. (2016). Introduction to MPI: the message passing interface. Introduction to HPC with MPI for Data Science, 21-62.
[10] Sur, S., Koop, M. J., & Panda, D. K. (2006, November). High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (pp. 105-es).
[11] Tuomanen, B. (2018). Hands-On GPU Programming with Python and CUDA: Explore high-performance parallel computing with CUDA. Packt Publishing Ltd.
[12] Abi-Chahla, F. (2008). Nvidia’s CUDA: The End of the CPU?. Tom’s Hardware, (s 15).
[13] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[14] Hashem, I. A. T., Anuar, N. B., Gani, A., Yaqoob, I., Xia, F., & Khan, S. U. (2016). MapReduce: Review and open challenges. Scientometrics, 109, 389-422.
[15] Laku, L. I. Y., Mohammed, A. F. Y., Fawaz, A. H., & Youn, C. H. (2019, February). Performance Evaluation of Apache Storm With Writing Scripts. In 2019 21st International Conference on Advanced Communication Technology (ICACT) (pp. 728-733). IEEE.
[16] Mundkur, P., Tuulos, V., & Flatow, J. (2011, September). Disco: a computing platform for large-scale data analytics. In Proceedings of the 10th ACM SIGPLAN workshop on Erlang (pp. 84-89).
[17] Wu, H., & Fu, M. (2021). Heron Streaming: Fundamentals, Applications, Operations, and Insights. Springer Nature.
[18] Gürcan, F., & Berigel, M. (2018, October). Real-time processing of big data streams: Lifecycle, tools, tasks, and challenges. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) (pp. 1-6). IEEE.
[19] Friedman, E., & Tzoumas, K. (2016). Introduction to Apache Flink: stream processing for real time and beyond. “O’Reilly Media, Inc."
[20] Baker, M. G., Hartman, J. H., Kupfer, M. D., Shirriff, K. W., & Ousterhout, J. K. (1991, September). Measurements of a distributed file system. In Proceedings of the thirteenth ACM symposium on Operating systems principles (pp. 198-212).
[21] Jin, L., Zhai, X., Wang, K., Zhang, K., Wu, D., Nazir, A., … & Liao, W. H. (2024). Big data, machine learning, and digital twin assisted additive manufacturing: A review. Materials & Design, 113086.
[22] Abramson, D., Jin, C., Luong, J., & Carroll, J. (2020, February). A BeeGFS-based caching file system for data-intensive parallel computing. In Asian Conference on Supercomputing Frontiers (pp. 3-22). Cham: Springer International Publishing.
[23] Liu, M. (2024). Key Technology of Distributed Memory File System Based on High-Performance Computer. International Journal of Cooperative Information Systems, 33(02), 2350019.
[24] Mezzoudj, S., Behloul, A., Seghir, R., & Saadna, Y. (2021). A parallel content-based image retrieval system using spark and tachyon frameworks. Journal of King Saud University-Computer and Information Sciences, 33(2), 141-149.
[25] Saliha, M., Ali, B., & Rachid, S. (2019). Towards large-scale face-based race classification on spark framework. Multimedia Tools and Applications, 78(18), 26729-26746.
[26] Mezzoudj, S. (2020). Towards large scale image retrieval system using parallel frameworks. In Multimedia Information Retrieval. IntechOpen.
[27] Saadna, Y., Behloul, A., & Mezzoudj, S. (2019). Speed limit sign detection and recognition system using SVM and MNIST datasets. Neural Computing and Applications, 31(9), 5005-5015.
[28] Meriem, K., Saliha, M., Amine, F. M., & Khaled, B. M. (2024). Novel Solutions to the Multidimensional Knapsack Problem Using CPLEX: New Results on ORX Benchmarks. Journal of Ubiquitous Computing and Communication Technologies, 6(3), 294-310.
Cite This Article
  • APA Style

    Mezzoudj, S., Khelifa, M., Saadna, Y. (2024). A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. International Journal of Data Science and Analysis, 10(5), 86-99. https://doi.org/10.11648/j.ijdsa.20241005.11

    Copy | Download

    ACS Style

    Mezzoudj, S.; Khelifa, M.; Saadna, Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int. J. Data Sci. Anal. 2024, 10(5), 86-99. doi: 10.11648/j.ijdsa.20241005.11

    Copy | Download

    AMA Style

    Mezzoudj S, Khelifa M, Saadna Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int J Data Sci Anal. 2024;10(5):86-99. doi: 10.11648/j.ijdsa.20241005.11

    Copy | Download

  • @article{10.11648/j.ijdsa.20241005.11,
      author = {Saliha Mezzoudj and Meriem Khelifa and Yasmina Saadna},
      title = {A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics
    },
      journal = {International Journal of Data Science and Analysis},
      volume = {10},
      number = {5},
      pages = {86-99},
      doi = {10.11648/j.ijdsa.20241005.11},
      url = {https://doi.org/10.11648/j.ijdsa.20241005.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20241005.11},
      abstract = {The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.
    },
     year = {2024}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics
    
    AU  - Saliha Mezzoudj
    AU  - Meriem Khelifa
    AU  - Yasmina Saadna
    Y1  - 2024/11/12
    PY  - 2024
    N1  - https://doi.org/10.11648/j.ijdsa.20241005.11
    DO  - 10.11648/j.ijdsa.20241005.11
    T2  - International Journal of Data Science and Analysis
    JF  - International Journal of Data Science and Analysis
    JO  - International Journal of Data Science and Analysis
    SP  - 86
    EP  - 99
    PB  - Science Publishing Group
    SN  - 2575-1891
    UR  - https://doi.org/10.11648/j.ijdsa.20241005.11
    AB  - The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.
    
    VL  - 10
    IS  - 5
    ER  - 

    Copy | Download

Author Information
  • Sections