الفهرس | Only 14 pages are availabe for public view |
Abstract Twitter popularity made it an important and instantaneous source of news and trending events around the world. It has attracted the attention of spammers who post malicious content embedded in tweets and in their profile pages. Spammers use different and evolving techniques to evade traditional security mechanisms, and that creates the need to develop robust solutions that adapt with these techniques. In this thesis, we focus on exploring different natural language processing methods to detect spam from tweets textual content. One of the models that we propose in this thesis is the character n-gram model, which has an advantage of being robust to spamming techniques that depend on word manipulations. Another set of models we explore, are the word embedding models built with popular word embedding techniques. Finally, we study the character embedding model, which is built using deep learning techniques. Using publicly available datasets, we evaluate the performance of multiple machine learning classifiers with the proposed models. Our experiments show that the result of some of our character n-gram models is achieving an F-measure of nearly 80%, which is an enhancement over the approaches that use the classical word n-grams from tweet tokens. We also show that our technique can detect spam tweets with low latency which is crucial in a real-time environment like Twitter. |