Author: El-Barogy, Reda El-Saied Mohamed./ Title: An Statistical Approach For Arabic Language Proessing /

Search In this Thesis

العنوان

An Statistical Approach For Arabic Language Proessing /

المؤلف

El-Barogy, Reda El-Saied Mohamed.

هيئة الاعداد

باحث / رضا السعيد محمد السيد الباروجى

مشرف / محمد أحمد انور الشهاوى

مشرف / أحمد عبد الفتاح الحربي

مشرف / خالد فؤاد شعلان

مناقش / إسماعيل عمرو إسماعيل

مناقش / علاء الدين محمد رياض

مناقش / محمد أحمد أنور الشهاوي

الموضوع

Natural Language. Arabic Language. Diacritization. Natural Language Processing. Statistical Language Modeling.

تاريخ النشر

2006.

عدد الصفحات

126 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الرياضيات (المتنوعة)

تاريخ الإجازة

1/1/2006

مكان الإجازة

جامعة دمياط - كلية العلوم - الرياضيات

الفهرس

Only 14 pages are availabe for public view

from

154

from

154

Abstract

This thesis presents two automatic systems. The first has the ability to restore vowels (diacritics) for nondiacritic Qur<U+2019>an words, using a unigram model and a bigram Hidden Markov Model (HMM). The second system has the ability to identify the Part Of Speech (POS) tags for Qur?an words, using the above models. Diacritization is the process of placing special marks above or under the letters of the word in Arabic language. Arabic texts could be either a diacritized text such as the language of Qur<U+2019>an or children<U+2019>s books; or a nondiacritized one used in newspapers, books, and media. Handling the nondiacritized texts is confusing since the nondiacritized word may have more than one meaning. In order to accomplish the diacritization process the first proposed system was designed. The first proposed system was very robust and reliable without using morphological analysis methods for diacritics restoration. It was found that the HMMs are useful tools for the task of diacritics restoration in Arabic language. The used techniques are simple to apply and do not require any language specific knowledge to be embedded in the model. Qur<U+2019>an was used as corpora; our system was implemented and also tested on many parts of Qur<U+2019>an as training sets. For instance, the proposed system was implemented on 1366 words starting from the beginning of the Qur?an, and the obtained performance was excellent: 94.3% word accuracy for a unigram model and 95.2% word accuracy for a bigram HMM model. The second proposed system was automatic POS tagging. Automatic POS tagging is an area of natural language processing where statistical technique have been more successful than rulebased methods. POS tagging can be defined as a process in which a proper POS tag is assigned to each word in texts. We designed and implemented a tagging system for tagging Arabic text using unigram model and bigram HMM. The methodology enables robust and accurate tagging with few resource requirements. Only a dictionary and some tagged training corpora are required. It was found that the HMMs are useful tools for the task of POS tagging of Arabic words taken from the holy Qur?an. We achieved an accuracy of 98.4% for the unigram model and 99.4% for the bigram HMM.