Main Article Content

Clustering Based Approach for Ground Truth Inference in Crowdsourced Data


Victor T. Odumuyiwa
Anurika Umeanozie
Oladipupo Sennaike
Olubukola Adekola
Babatunde Sawyerr
Ebun Fasina

Abstract

Crowdsourcing provides a means of gathering data from the public in order to infer what the ground truth label of an unfamiliar entity is. Such data are not used for decision making in their raw form until further processing is done to infer ground truth from the crowdsourced data.  This paper presents a detailed comparative analysis of the ground truth inference ability of three clustering algorithms on crowd sourced datasets with different experimental scenarios (Initializing centroids and extracting class labels). The algorithms include, the self-organizing maps, the k-means and the expectation maximization clustering algorithm. The approach used entails generating a new dataset containing the probability distributions of the class predictions for each example in the noisy dataset, then clustering the data points using the generated probability features in order to infer their class labels. The three algorithms were implemented and compared with the Majority voting   algorithm on the different datasets used in this research. The datasets used are Adult2, weather sentiments, emotion, valence5 and employee review dataset Four possible experimental scenarios for inferring the ground truth label from the curated dataset were analysed. The first scenario makes use of the clustering algorithm alone relying on the inner workings of the algorithm to predict the ground truth, while the second scenario makes use of an extract class label mechanism where the ground truth label was inferred by performing a further analysis on the clusters provided by the algorithm. In the third scenario, the centroids of the clustering algorithm were pre-initialized by setting the maximum value in each class from the curated data as a centroid, where centroid might mean something different relative to the particular algorithm. The fourth experimental scenario is a combination of the second and third scenario. Experimental results show that the self-organizing map (SOM) performs best across all the datasets when the weights of the units in the SOM are pre-initialized. SOM had the best performance on the weather sentiments dataset recording 92.49% accuracy and ROC AUC score of 0.88. It also recorded the best overall average accuracy of 50.2% and ROC AUC score of 0.59365 across all the datasets.