![]() | Only 14 pages are availabe for public view |
Abstract Social media platforms, online news commenting spaces, and many other public forums have become widely known for issues of abusive behavior such as cyberbullying and online harassment. The offensive contents posted to these sites vary in the type of offense and the target to whom it is addressed. The type of offense can be profanity, threat, or identitybased hate. While the target of it can be an individual, a group of people or some organization or entity.In this thesis, different deep learning networks are proposed to predict the toxicity types and identify the offense target in users tweets and comments. These networks predictions are also combined using weighted soft voting to improve the overall solution performance. Additionally, various data augmentation methods are applied to solve the imbalanced class distribution problem present in most of the offensive language datasets.These methods include back translation, random synonyms substitutions and random masking of the existing samples words. Moreover, to tackle the problem of small dataset size, pre-trained BERT model is fine-tuned and added to the deep learning classifiers ensemble The proposed solution is evaluated on two different datasets. On Offensive Language Identification Dataset (OLID), which was the official dataset in OffensEval 2019, the proposed solution achieved a macro average F1-score of 0.8336, 0.7985 and 0.6963 for sub-task A, B and C respectively.This outperforms the winning solutions on each of the three sub-tasks which got 0.829, 0.755 and 0.660 macro average F1-score on sub-task A,B and C respectively. While for Wikipedia’s toxic comments dataset,the presented classifiers ensemble improved the macro average F1-score of toxic/nontoxic classification and toxicity types prediction from 0.8429 and 0.5800 using the classical NB-SVM to 0.9127 and 0.6931 respectively. |