A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics

Saliha Mezzoudj; Meriem Khelifa; Yasmina Saadna

doi:doi:10.11648/j.ijdsa.20241005.11

Review Article |

| Peer-Reviewed

A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics

Saliha Mezzoudj^*

, Meriem Khelifa

, Yasmina Saadna

Published in International Journal of Data Science and Analysis (Volume 10, Issue 5)

Received: 29 September 2024 Accepted: 14 October 2024 Published: 12 November 2024

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.

Published in	International Journal of Data Science and Analysis (Volume 10, Issue 5)
DOI	10.11648/j.ijdsa.20241005.11
Page(s)	86-99
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Parallel Processing Frameworks, Distributed Storage Frameworks, MapReduce, CUDA, Storm, Flink, MooseFS, BeeGFS

References

[1]	Souibgui, M., Atigui, F., Zammali, S., Cherfi, S., & Yahia, S. B. (2019). Data quality in ETL process: A preliminary study. Procedia Computer Science, 159, 676-687. https://doi.org/10.1016/j.procs.2019.09.223
[2]	Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3(2), 625-650. https://doi.org/10.48550/arXiv.2001.03259
[3]	Butenhof, D. R. (1993). Programming with POSIX threads. Addison-Wesley Professional.
[4]	Shen, J. P., & Lipasti, M. H. (2013). Modern processor design: fundamentals of superscalar processors. Waveland.
[5]	Culler, D., Singh, J. P., & Gupta, A. (1999). Parallel computer architecture: a hardware/software approach. Gulf Professional Publishing.
[6]	Castelló, A., Gual, R. M., Seo, S., Balaji, P., Quintana-Orti, E. S., & Pena, A. J. (2020). Analysis of threading libraries for high performance computing. IEEE Transactions on Computers, 69(9), 1279-1292. https://doi.org/10.1109/TC.2020.2970706
[7]	Silberschatz, A., Galvin, P. B., & Gagne, G. (2012). Operating system concepts.
[8]	OpenMP, A. R. B. (2013, July). OpenMP application program interface version 4.0. In The OpenMP Forum, Tech. Rep.
[9]	Nielsen, F., & Nielsen, F. (2016). Introduction to MPI: the message passing interface. Introduction to HPC with MPI for Data Science, 21-62. https://doi.org/10.1007/978-3-319-21903-5_2
[10]	Sur, S., Koop, M. J., & Panda, D. K. (2006, November). High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing (pp. 105-es). https://doi.org/10.1109/SC.2006.34
[11]	Tuomanen, B. (2018). Hands-On GPU Programming with Python and CUDA: Explore high-performance parallel computing with CUDA. Packt Publishing Ltd.
[12]	Abi-Chahla, F. (2008). Nvidia’s CUDA: The End of the CPU?. Tom’s Hardware, (s 15).
[13]	Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. https://doi.org/10.1145/1327452.1327492
[14]	Hashem, I. A. T., Anuar, N. B., Gani, A., Yaqoob, I., Xia, F., & Khan, S. U. (2016). MapReduce: Review and open challenges. Scientometrics, 109, 389-422. https://doi.org/10.1007/s11192-016-1945-y
[15]	Laku, L. I. Y., Mohammed, A. F. Y., Fawaz, A. H., & Youn, C. H. (2019, February). Performance Evaluation of Apache Storm With Writing Scripts. In 2019 21^st International Conference on Advanced Communication Technology (ICACT) (pp. 728-733). IEEE. https://doi.org/10.1007/978-3-030-79478-1_24
[16]	Mundkur, P., Tuulos, V., & Flatow, J. (2011, September). Disco: a computing platform for large-scale data analytics. In Proceedings of the 10^th ACM SIGPLAN workshop on Erlang (pp. 84-89). https://doi.org/10.1145/2034654.2034670
[17]	Wu, H., & Fu, M. (2021). Heron Streaming: Fundamentals, Applications, Operations, and Insights. Springer Nature.
[18]	Gürcan, F., & Berigel, M. (2018, October). Real-time processing of big data streams: Lifecycle, tools, tasks, and challenges. In 2018 2^nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) (pp. 1-6). IEEE. https://doi.org/10.1109/ISMSIT.2018.8567061
[19]	Friedman, E., & Tzoumas, K. (2016). Introduction to Apache Flink: stream processing for real time and beyond. “O’Reilly Media, Inc."
[20]	Baker, M. G., Hartman, J. H., Kupfer, M. D., Shirriff, K. W., & Ousterhout, J. K. (1991, September). Measurements of a distributed file system. In Proceedings of the thirteenth ACM symposium on Operating systems principles (pp. 198-212). https://doi.org/10.1145/121133.121164
[21]	Jin, L., Zhai, X., Wang, K., Zhang, K., Wu, D., Nazir, A., … & Liao, W. H. (2024). Big data, machine learning, and digital twin assisted additive manufacturing: A review. Materials & Design, 113086. https://doi.org/10.1016/j.matdes.2024.113086
[22]	Abramson, D., Jin, C., Luong, J., & Carroll, J. (2020, February). A BeeGFS-based caching file system for data-intensive parallel computing. In Asian Conference on Supercomputing Frontiers (pp. 3-22). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-48842-0_1
[23]	Liu, M. (2024). Key Technology of Distributed Memory File System Based on High-Performance Computer. International Journal of Cooperative Information Systems, 33(02), 2350019. https://doi.org/10.1142/S0218843023500193
[24]	Mezzoudj, S., Behloul, A., Seghir, R., & Saadna, Y. (2021). A parallel content-based image retrieval system using spark and tachyon frameworks. Journal of King Saud University-Computer and Information Sciences, 33(2), 141-149. https://doi.org/10.1016/j.jksuci.2019.01.003
[25]	Saliha, M., Ali, B., & Rachid, S. (2019). Towards large-scale face-based race classification on spark framework. Multimedia Tools and Applications, 78(18), 26729-26746. https://doi.org/10.1007/s11042-019-7672-7
[26]	Mezzoudj, S. (2020). Towards large scale image retrieval system using parallel frameworks. In Multimedia Information Retrieval. IntechOpen. https://doi.org/10.5772/intechopen.94910
[27]	Saadna, Y., Behloul, A., & Mezzoudj, S. (2019). Speed limit sign detection and recognition system using SVM and MNIST datasets. Neural Computing and Applications, 31(9), 5005-5015. https://doi.org/10.1007/s00521-018-03994-w
[28]	Meriem, K., Saliha, M., Amine, F. M., & Khaled, B. M. (2024). Novel Solutions to the Multidimensional Knapsack Problem Using CPLEX: New Results on ORX Benchmarks. Journal of Ubiquitous Computing and Communication Technologies, 6(3), 294-310. https://doi.org/10.1007/11499305_

Cite This Article

Plain Text BibTeX RIS

APA Style

Mezzoudj, S., Khelifa, M., Saadna, Y. (2024). A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. International Journal of Data Science and Analysis, 10(5), 86-99. https://doi.org/10.11648/j.ijdsa.20241005.11

Copy | Download

ACS Style

Mezzoudj, S.; Khelifa, M.; Saadna, Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int. J. Data Sci. Anal. 2024, 10(5), 86-99. doi: 10.11648/j.ijdsa.20241005.11

Copy | Download

AMA Style

Mezzoudj S, Khelifa M, Saadna Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int J Data Sci Anal. 2024;10(5):86-99. doi: 10.11648/j.ijdsa.20241005.11

Copy | Download

@article{10.11648/j.ijdsa.20241005.11,
  author = {Saliha Mezzoudj and Meriem Khelifa and Yasmina Saadna},
  title = {A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics
},
  journal = {International Journal of Data Science and Analysis},
  volume = {10},
  number = {5},
  pages = {86-99},
  doi = {10.11648/j.ijdsa.20241005.11},
  url = {https://doi.org/10.11648/j.ijdsa.20241005.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20241005.11},
  abstract = {The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.
},
 year = {2024}
}

Copy | Download

TY  - JOUR
T1  - A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics

AU  - Saliha Mezzoudj
AU  - Meriem Khelifa
AU  - Yasmina Saadna
Y1  - 2024/11/12
PY  - 2024
N1  - https://doi.org/10.11648/j.ijdsa.20241005.11
DO  - 10.11648/j.ijdsa.20241005.11
T2  - International Journal of Data Science and Analysis
JF  - International Journal of Data Science and Analysis
JO  - International Journal of Data Science and Analysis
SP  - 86
EP  - 99
PB  - Science Publishing Group
SN  - 2575-1891
UR  - https://doi.org/10.11648/j.ijdsa.20241005.11
AB  - The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.

VL  - 10
IS  - 5
ER  -

Copy | Download

Author Information

Saliha Mezzoudj

Department of Computer Science, University of Algiers, Algiers, Algeria

Contact Email

http://orcid.org/0009-0002-6230-0220
Meriem Khelifa

Department of Computer Science and Information Technologies, University of Kasdi Merbah Ouargla, Ouargla, Algeria

Contact Email

http://orcid.org/0009-0002-2153-598X
Yasmina Saadna

Department of Computer Science, University of Batna 2, Batna, Algeria

Contact Email

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Mezzoudj, S., Khelifa, M., Saadna, Y. (2024). A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. International Journal of Data Science and Analysis, 10(5), 86-99. https://doi.org/10.11648/j.ijdsa.20241005.11

Copy | Download

ACS Style

Mezzoudj, S.; Khelifa, M.; Saadna, Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int. J. Data Sci. Anal. 2024, 10(5), 86-99. doi: 10.11648/j.ijdsa.20241005.11

Copy | Download

AMA Style

Mezzoudj S, Khelifa M, Saadna Y. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics. Int J Data Sci Anal. 2024;10(5):86-99. doi: 10.11648/j.ijdsa.20241005.11

Copy | Download

@article{10.11648/j.ijdsa.20241005.11,
  author = {Saliha Mezzoudj and Meriem Khelifa and Yasmina Saadna},
  title = {A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics
},
  journal = {International Journal of Data Science and Analysis},
  volume = {10},
  number = {5},
  pages = {86-99},
  doi = {10.11648/j.ijdsa.20241005.11},
  url = {https://doi.org/10.11648/j.ijdsa.20241005.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20241005.11},
  abstract = {The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.
},
 year = {2024}
}

Copy | Download

TY  - JOUR
T1  - A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics

AU  - Saliha Mezzoudj
AU  - Meriem Khelifa
AU  - Yasmina Saadna
Y1  - 2024/11/12
PY  - 2024
N1  - https://doi.org/10.11648/j.ijdsa.20241005.11
DO  - 10.11648/j.ijdsa.20241005.11
T2  - International Journal of Data Science and Analysis
JF  - International Journal of Data Science and Analysis
JO  - International Journal of Data Science and Analysis
SP  - 86
EP  - 99
PB  - Science Publishing Group
SN  - 2575-1891
UR  - https://doi.org/10.11648/j.ijdsa.20241005.11
AB  - The significance of developing Big Data applications has increased in recent years, with numerous organizations across various industries relying more on insights derived from vast amounts of data. However, conventional data techniques and platforms struggle to cope the Big Data, exhibiting sluggish responsiveness and deficiencies in scalability, performance, and accuracy. In response to the intricate challenges posed by Big Data, considerable efforts have been made, leading to the creation of a range of distributions and technologies. This article addresses the critical need for efficient processing and storage solutions in the context of the ever-growing field of big data. It offers a comparative analysis of various parallel processing techniques and distributed storage frameworks, emphasizing their importance in big data analytics. Our study begins with definitions of key concepts, clarifying the roles and interconnections of parallel processing and distributed storage. It further evaluates a range of architectures and technologies, such as MapReduce, CUDA, Storm, Flink, MooseFS, and BeeGFS and others technologies, discussing their advantages and limitations in managing large-scale datasets. Key performance metrics are also examined, providing a comprehensive understanding of their effectiveness in big data scenarios.

VL  - 10
IS  - 5
ER  -

Copy | Download