الفهرس | Only 14 pages are availabe for public view |
Abstract Data is a very important asset for any organization and its quality has serious consequences on the business and the organization. Data quality is a basic concern for a wide range of information systems as data warehouses, business intelligence, customer relationship management and supply chain management. Data quality was stated in the literature as a multi dimensional concept that includes completeness, accuracy, timeliness, consistency ...etc. In this work data quality is measured applying data mining algorithms that are able to discover previously unknown patterns and relationships in a dataset besides they can handle discrete and continues data. Two algorithms are applied neural networks and support vector machines. Data quality dimensions completeness and timeliness are selected to be measured as they are two of the dimensions shared by most of data quality dimensions proposals and they are two of the basic data quality dimensions set. The proposed methodology to measure the two dimensions considers a very important aspect of data that is the field type. It is the field being mandatory, not applicable and optional. The measurement is done for a real unbalanced dataset so to train the data mining model a mechanism for handling the unbalance problem is followed by duplicating the minority instances. Cross validation method is applied to evaluate the performance measures for the two applied data mining algorithms then the registered results are compared. First, the data quality dimension completeness is assessed applying statistical, neural network and support vector machine models which judge the state of each data row whether it is complete or not then the dataset completeness is calculated. Neural network and support vector machine models act as classifiers for the row completeness. Second, the data quality dimension timeliness is assessed using also statistical, neural network and a support vector machine models. The models calculate each row timeliness value then the dataset timeliness is measured. |