Author: Hassan, Yosr Hussien./ Title: Incremental Distributed Clustering of Sgml Documents Using Phrase

Search In this Thesis

العنوان

Incremental Distributed Clustering of Sgml Documents Using Phrase - Based Indexing =

المؤلف

Hassan, Yosr Hussien.

هيئة الاعداد

مشرف / مجدى ناجى

مشرف / محمود جبر

باحث / يسر حسين حسن

مناقش / امانى سعد

الموضوع

Incremental. Distributed. Clustering. Documents. Phrase - Based.

تاريخ النشر

2013.

عدد الصفحات

102 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2013

مكان الإجازة

جامعة الاسكندريه - كلية العلوم - Computer Science

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Document clustering is a text mining task that is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building taxonomy of documents, and others. When the document data set to be clustered is large, the document clustering process is costly with respect to time and hardware resources. We have noticed that sometimes not all the provided hardware resources are exploited during the document clustering process, especially when it is performed in a sequential manner. Consequently, the idea we propose here is to perform document clustering in a distributed and incremental manner that enables us to exploit any provided hardware resources aiming to reduce the cost and maximize the utilization of the hardware.
To perform document clustering, the work introduced here is mainly divided into two key parts. First part is distributed and incremental that index documents and calculate pair-wise similarities. Most document clustering techniques rely on single-term analysis of the document data set, such as the Vector Space Model. However, single-term document analysis cannot capture the structure of sentences because it models documents by their single words only. Thus, our work is based on the model that was introduced in [1]; besides single-term indexing, we provide phrase-based document indexing to achieve more accurate measurement of document similarity. The phrase-based document indexing that was introduced in [1] is called Document Index Graph. Using graphs, it allows for incremental construction of a phrase-based index of the document set which provides efficient phrase matching that is used to judge the similarity between documents. Distributed and incrementally, our model performs documents indexing and gathers the effects of single-term indexing and phrase-based indexing to calculate document similarity. The gathering of both indexing is flexible because it can revert to a compact representation of the vector space model if we choose not to index phrases. This gives robust and accurate document similarity calculation that leads to much improved results in document clustering.
Second part is incremental but not distributed; it performs document clustering based on the similarity histogram-based clustering algorithm that was introduced in [1]. This clustering algorithm is based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. In addition, the insertion order problem resulted as a side effect of being incremental is reduced by performing periodically re-clustering process.
The combination of these two components creates an underlying robust document clustering model that mainly aims to utilize any given hardware resources and reduce the consumed time with an emphasis on efficiency. In addition, the model is flexible such that it allows the preferring of consumption time to the cost of hardware resources and vice versa.