Author: Ali, Mai Ibrahim Hussien Mohammed./ Title: Imbalanced Toxic Comments Classification Using Data Augmentation and Deep Learning \

Search In this Thesis

العنوان

Imbalanced Toxic Comments Classification Using Data Augmentation and Deep Learning \

المؤلف

Ali, Mai Ibrahim Hussien Mohammed.

هيئة الاعداد

باحث / مي ابراهيم حسين محمد على

مشرف / نجوى مصطفى المكى

nagwamakky@gmail.com

مشرف / مروان عبد الحميد تركى

marwantorki@gmail.com

مناقش / محمد عبد الحميد اسماعيل احمد

drmaismail@gmail.com

مناقش / صالح عبد الشكور الشهابى

الموضوع

Computer Engineering.

تاريخ النشر

2020.

عدد الصفحات

87 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة (متفرقات)

تاريخ الإجازة

1/11/2020

مكان الإجازة

جامعة الاسكندريه - كلية الهندسة - هندسة الحاسب والنظم

الفهرس

Only 14 pages are availabe for public view

from

109

from

109

Abstract

Social media platforms, online news commenting spaces, and many other public forums have become widely known for issues of abusive behavior such as cyberbullying and online harassment. The offensive contents posted to these sites vary in the type of offense and the target to whom it is addressed. The type of offense can be profanity, threat, or identitybased hate. While the target of it can be an individual, a group of people or some organization or entity.In this thesis, different deep learning networks are proposed to predict the toxicity types and identify the offense target in users tweets and comments. These networks predictions are also combined using weighted soft voting to improve the overall solution performance. Additionally, various data augmentation methods are applied to solve the imbalanced class distribution problem present in most of the offensive language datasets.These methods include back translation, random synonyms substitutions and random masking of the existing samples words. Moreover, to tackle the problem of small dataset size, pre-trained BERT model is fine-tuned and added to the deep learning classifiers ensemble The proposed solution is evaluated on two different datasets. On Offensive Language Identification Dataset (OLID), which was the official dataset in OffensEval 2019, the proposed solution achieved a macro average F1-score of 0.8336, 0.7985 and 0.6963 for sub-task A, B and C respectively.This outperforms the winning solutions on each of the three sub-tasks which got 0.829, 0.755 and 0.660 macro average F1-score on sub-task A,B and C respectively. While for Wikipedia’s toxic comments dataset,the presented classifiers ensemble improved the macro average F1-score of toxic/nontoxic classification and toxicity types prediction from 0.8429 and 0.5800 using the classical NB-SVM to 0.9127 and 0.6931 respectively.