الفهرس | Only 14 pages are availabe for public view |
Abstract This thesis presents two automatic systems. The first has the ability to restore vowels (diacritics) for nondiacritic Qur<U+2019>an words, using a unigram model and a bigram Hidden Markov Model (HMM). The second system has the ability to identify the Part Of Speech (POS) tags for Qur?an words, using the above models. Diacritization is the process of placing special marks above or under the letters of the word in Arabic language. Arabic texts could be either a diacritized text such as the language of Qur<U+2019>an or children<U+2019>s books; or a nondiacritized one used in newspapers, books, and media. Handling the nondiacritized texts is confusing since the nondiacritized word may have more than one meaning. In order to accomplish the diacritization process the first proposed system was designed. The first proposed system was very robust and reliable without using morphological analysis methods for diacritics restoration. It was found that the HMMs are useful tools for the task of diacritics restoration in Arabic language. The used techniques are simple to apply and do not require any language specific knowledge to be embedded in the model. Qur<U+2019>an was used as corpora; our system was implemented and also tested on many parts of Qur<U+2019>an as training sets. For instance, the proposed system was implemented on 1366 words starting from the beginning of the Qur?an, and the obtained performance was excellent: 94.3% word accuracy for a unigram model and 95.2% word accuracy for a bigram HMM model. The second proposed system was automatic POS tagging. Automatic POS tagging is an area of natural language processing where statistical technique have been more successful than rulebased methods. POS tagging can be defined as a process in which a proper POS tag is assigned to each word in texts. We designed and implemented a tagging system for tagging Arabic text using unigram model and bigram HMM. The methodology enables robust and accurate tagging with few resource requirements. Only a dictionary and some tagged training corpora are required. It was found that the HMMs are useful tools for the task of POS tagging of Arabic words taken from the holy Qur?an. We achieved an accuracy of 98.4% for the unigram model and 99.4% for the bigram HMM. |