Main Article Content
Evaluating the Significance of Data Engineering Techniques in Multi-Class Prediction: Multi-Factor Educational Data Mining Experiments
Abstract
Artificial Intelligence, particularly predictive modelling, is increasingly influencing education. For instance, a specific algorithm predicted with 74% accuracy the students that would fail within three weeks of the course. These results could lead to interventions that promote inclusivity and personalized learning, supporting the UN's goals of quality education and reducing inequalities. While predictive analytics holds great promise for education, datasets often suffer from small sample sizes and class imbalances which can result in inaccurate predictions and biased machine learning models. In this study, we evaluate the significance of various data engineering techniques in the context of educational data mining using a multi-factor supervised learning experiment. We applied data augmentation and balancing techniques to assess their impact on model performance. Additionally, data discretization for continuous features and feature selection, to identify the most relevant features for model training, were implemented and evaluated. The experimental design followed a 2 X 2 X 3 X3 factorial structure, incorporating different combinations of these techniques. We employed three models: Random Forest, Decision Tree, and Feed Forward Neural Network. The performance was measured using accuracy and F1 score metrics. The results also show that the data augmentation and balancing techniques seem to improve testing accuracy and F1 scores slightly, particularly for simpler models like Decision Trees. Feedforward Neural Networks perform more consistently across different datasets, while Decision Trees and Random Forests are more prone to overfitting, particularly without proper data balancing or augmentation.