Author: Setohy, Ali Ali Ahmed./ Title: Big Data Analysis Using Nature-Inspired Metaheuristics /

Search In this Thesis

العنوان

Big Data Analysis Using Nature-Inspired Metaheuristics /

المؤلف

Setohy, Ali Ali Ahmed.

هيئة الاعداد

باحث / علي علي أحمد سطوحي

مشرف / عوني عبدالهادي سيد

مشرف / محمد مسعود عبدالله

الموضوع

Algorithms. Computer communication systems. omputer science - Mathematics. Numerical analysis.

تاريخ النشر

2021.

عدد الصفحات

98 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2021

مكان الإجازة

جامعة المنيا - كلية العلوم - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Recently, analyzing big data is a very common requirement, all such requirements become difficult to access when analyzing bulk data sources such as social networking sites which are having a lot of information daily. Big data analytics systems also provide data from both internal and external sources, such as forecast data or consumer demographic data compiled by third-party information service providers.
Also, streaming analytics technologies are becoming more popular in large data environments, as users are looking to conduct real-time data analysis on data fed into Hadoop systems via stream processing engines, such as Spark.
Big data analysis is a complicated method of looking at big data to find knowledge such as secret patterns, associations, industry dynamics and consumer desires that can help companies make better business decisions.
Apache Spark is a data processing framework that can quickly execute processing tasks on very large data sets and can also distribute data processing tasks on several devices, either on its own or in combination with other distributed computing resources. These two characteristics are essential to the worlds of big data and machine learning, which require massive processing resources to break into huge data stores. Spark also eliminates some of the programming burdens of these tasks from the developers’ hands with an easy-to-use API that takes out most of the grunt work of distributed computing and big data analysis.
Twitter is the microblogging and popular site providing social networking services today. Analyzing big amounts of tweets from Twitter to get different patterns and extract relevant information is a big challenge.
Therefore, this thesis proposed a metaheuristic method based on Particle Swarm Optimization and K-means (PSOK) which is being used to cluster the sentimental contents. The main idea of our proposed method is to find the optimum cluster-heads from the streaming tweets.
The proposed method consists of five steps. In the first step, we collected tweets from Twitter Live Stream to obtain a specific hashtag by using Spark. Then, we applied a set of preprocessing on the collected tweets to reduce the noise such as unwanted and fuzzy words, URLs, multiple spaces. After the preprocessing, tweets are converted from the Twitter stream into a feature vector, and the initial population is generated using the K-means algorithm by applying it N times to generate the initial solution (S). Finally, we applied the particle swarm optimization algorithm on the generated population to get the best solution.
The efficiency of our proposed method has been tested on four different Twitter datasets and we compared our results with six existing methods namely, differential evolution (DE), cuckoo search (CS), improved cuckoo search (ICS), particle swarm optimization (PSO), gauss distribution-based cuckoo search (GCS) and cuckoo search with K-means(CSK).