Search In this Thesis
   Search In this Thesis  
العنوان
Machine Understanding through Unsupervised Web Semantification /
المؤلف
Gerguis, Michel Naim Naguib.
هيئة الاعداد
باحث / ميشيل نعيم نجيب جرجس
مشرف / محمد واثق علي كامل الخراشي
مشرف / شريف رمزي سلامة
مناقش / هدي قرشي محمد
تاريخ النشر
2017.
عدد الصفحات
85p. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
1/1/2017
مكان الإجازة
جامعة عين شمس - كلية الهندسة - كهرباء حاسبات
الفهرس
Only 14 pages are availabe for public view

from 124

from 124

Abstract

Fine grained classification is now crucial for machine understanding. This thesis introduces ClassifyWiki, a framework that automatically generates Wikipedia-based text classifiers given a small set of positive training articles. The target level of granularity is left to the consumer, allowing ClassifyWiki to build classifiers for persons, sportspersons, or even footballers. The main goal is to simplify the process of collecting hundreds or thousands of Wikipedia pages with the same entity class, using a set of positive articles with sizes possibly as small as 10 pages. ClassifyWiki learned from many previous classifiers that tackled few entity classes in order to build a generic framework on top of them to tackle any entity class. ClassifyWiki’s output is not models for some entity classes but a framework tuned through more than a hundred of experiments to generate models for any given positive articles. To test the framework, we manually tagged a data set of 2500 Wikipedia pages with the finest grained types we can. The data set covers 808 unique classes on different levels of granularity. ClassifyWiki was tested over 103 different entity classes varying in size down to only 5 positive articles.
On our blind set, we report that ClassifyWiki achieved a macro-averaged f1-score of 83% for 13 entity classes on different levels with 96% precision and 74% recall using 50 or more positive articles. For the main classes, ClassifyWiki scored 97% for Person class using 299 training instances, 79% for location using 214 instances, and 65% for Organizations using 82 instance.
Also, we present WikiTrends, a new analytics framework for Wikipedia articles. It adds the temporal/spatial dimensions to Wikipedia articles in order to visualize the extracted information converting the big static encyclopedia to a vibrant one. WikiTrends enables the generation of aggregated views in timelines or maps for any user-defined collection from unstructured text. Data mining techniques were applied to detect the nationality, start and end year of existence, gender, and entity class for around 4.85 million pages. We evaluated our extractors over a random set of 100 manually tagged pages. Heat maps of notable football players’ counts over history or dominant occupations in some specific era are samples of summarizing Wikipedia’s big data in WikiTrends maps. WikiTrends’ timelines can easily illustrate interesting fame battles over history between male/female actors, music genres, or even between American, Italian, and Indian films. Through information visualization and simple configurations, WikiTrends starts a new experience in answering questions through a figure. The framework is designed to be easily extended so different information types through new extractors could be integrated.
x
Finally, we present ASU system submitted in the COLING W-NUT 2016 Twitter Named Entity Recognition (NER) task. We present an experimental study on applying deep learning to extracting named entities (NEs) from tweets. We built two Long Short-Term Memory (LSTM) models for the task. The first model was built to extract named entities without types while the second model was built to extract and then classify them into 10 fine-grained entity classes. In this effort, we show detailed experimentation results on the eFFectiveness of word embeddings, brown clusters, part-of-speech (POS) tags, shape features, gazetteers, and local context for the tweet input vector representation to the LSTM model. Also, we present a set of experiments, to better design the network parameters for the Twitter NER task. Our system was ranked the fifth out of ten participants with a final f1-score for the typed classes of 39% and 55% for the non typed ones.