اختيار الموقع            تسجيل دخول
 

تسجيل دخول للنظام
  كود المستخدم
  كلمة السر
نسيت كلمة السر؟
دوريات النشر الإلكتروني



هندسة اللغة:
 هندسة اللغة:
  تفاصيل البحث
 
[9000516.] رقم البحث : 9000516 -
Lexical and Morphological Statistics of an Arabic POS-Tagged Corpus /
تخصص البحث : NLP for Information Retrieval
  هندسة اللغة: / عدد(1) - مجلد (1) - يناير 2014
  Hamdy Mubarak ( hamdys@sakhr.com - )
  Kareem Shaban ( kshaban@sakhr.com - )
  Forat Adel ( forat@sakhr.com - )
  Corpus Statistics, Arabic NLP, POS Tagging, Diacritization, MSA
  Part-Of-Speech (POS) tagging is a basic component necessary for many Natural Language Processing (NLP) applications. Building a manually tagged corpus helps in studying key statistics of a given language which form the basis for POS tagging systems. In this paper, we present both lexical and morphological statistics for Arabic that are derived from the Sakhr’s POS manually tagged corpus. It covers text (7 M words) from a wide range of Arab countries in different domains over the years 2002-2004. The derived statistics are used as heuristics and preferential rules within a statistical Diacritizer which achieves a high accuracy in stem diacritization and POS disambiguation. Statistics includes information related to sentence and word lengths, punctuation marks, distribution of Arabic letters and diacritics, in addition to lexical and morphological information for POS distribution, stems, prefixes, suffixes, roots, morphological patterns, and morphosyntactic features like gender, number, person, and case ending. Modern Standard Arabic (MSA) is studied by analyzing the coverage of stems, roots, morphological patterns, prefixes, and suffixes. Comparisons with an arbitrary English corpus are shown in applicable cases.
  Download Paper


 







Powered by Future Library Software.All rights reserved © CITC - Mansoura University. Sponsored by Mansoura University Privacy Policy