Author: Yasmin Amr Ahmed Anwar Badr ,/ Title: Automatic clustering of DNA sequences with intelligent techniques /

Search In this Thesis

العنوان

Automatic clustering of DNA sequences with intelligent techniques /

المؤلف

Yasmin Amr Ahmed Anwar Badr ,

هيئة الاعداد

باحث / Yasmin Amr Ahmed Anwar Badr

مشرف / Khaled T. Wassif

مشرف / Mahmoud S. Othman

مشرف / Khaled T. Wassif

الموضوع

Computer Science

تاريخ النشر

2022.

عدد الصفحات

78 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

16/4/2021

مكان الإجازة

جامعة القاهرة - كلية الحاسبات و المعلومات - Yasmin Amr Ahmed Anwar Badr

الفهرس

Only 14 pages are availabe for public view

from

Abstract

With the discovery of new DNAs, a fundamental problem arising is how to categorize those DNA sequences into correct species. Unfortunately, identifying all data groups correctly and assigning a set of DNAs into k clusters where k must be predefined are one of the major drawbacks in clustering analysis, especially when the data have many dimensions, and the number of clusters is too large and hard to guess. Furthermore, finding a similarity measure that preserves the functionality and represents both the composition and distribution of the bases in a DNA sequence is one of the main challenges in computational biology.
In this thesis, a new soft computing metaheuristic framework is introduced for automatic clustering to generate the optimal cluster formation and to determine the best estimate for the number of clusters. Pulse coupled neural network (PCNN) is utilized for the calculation of DNA sequence similarity or dissimilarity. Bat algorithm is hybridized with the well-known genetic algorithm to solve the automatic data clustering problem. Extensive computational experiments are conducted on the expanded human oral microbiome database (eHOMD). The simulation results showed that the hybrid GABAT outperformed the two state of-the-art clustering algorithms genetic algorithm , bat algorithm and other competing metaheuristic algorithms. GABAT showed better mean and standard deviation values achieving 0.40954 , 0.0197 using Euclidean distance and 0.012312 , 0.003918 using entropy as a distance measure , respectively. Wilcoxon test is conducted to statically validate the obtained clusters, and it showed a significant p-value of less than 5% where the bat algorithm outperformed the genetic algorithm, and the GABAT outperformed the bat algorithm . This proves that GABAT performed better than its competitors