Main Article Content
Theoretical Utility of Data Value Metric and Genetic Algorithms for Variable Clustering in an Unsupervised Learning Environment
Abstract
Cluster analysis is regarded as one of the most important unsupervised learning tasks, with its natural application in dividing data into meaningful groups, also known as clusters, based on the information in the data by describing the objects in terms of their relationships and capturing the data's natural structure. Many traditional performance evaluation metrics for clustering algorithms abound in the literature, treating various attributes or variables equally when measuring similarity; however, different attributes or variables may contribute differently due to the amount of information they contain, which can vary greatly. Data Value Metric (DVM) is an information theoretic measure based on the concept of mutual information that has been shown to be a good metric for validating data quality and utility in a big data ecosystem and in traditional data. Because it uses a forward selection search strategy, Data Value Metric (DVM) suffers from local minima and loss of diversity in the population; however, hybridizing it with Genetic Algorithm will overcome the problem of local minima because there will be a blend of evolutionary search to ensure a balance between exploration and exploitation of the search space. This paper proposed a hybrid model of the Genetic Algorithm and the Data Value Metric (DVM) as an information theoretic metric for quantifying the quality and utility of variable clustering selection that can be applied to traditional data.