Isaac Kega

and 3 more

In machine learning, feature selection is of utmost importance for augmenting the predictive capabilities of ensemble models. This paper presents an innovative hybrid framework for selecting features in ensemble models, which combines Rough Set Theory (RST) with Recursive Feature Elimination (RFE), complemented by Association Rule Mining, to enhance interpretability. The suggested method considerably improves ensemble models’ prognostic accuracy and comprehensibility, particularly Random Forests and Gradient Boosting Machines. The framework starts with the RFE process, meticulously eliminating less influential features, and then applies RST to refine the feature set further by eliminating redundancies. This two-phase approach results in a feature set that is optimally reduced yet highly influential. By implementing this hybrid method on ensemble models, significant improvements in predictive accuracy are demonstrated across three diverse datasets: cancer, Pima Indians Diabetes, and a weather dataset from Underground. The accomplished accuracies for these datasets were 0.9663, 0.8793, and 0.8427, respectively, highlighting the proposed approach’s effectiveness. This article also proposes the incorporation of association rule mining to analyze the outcomes of the models. This technique improves the understandability of the models, offering more profound insights into the connections and patterns, thus tackling the difficulty of interpretability in intricate ensemble models. Our empirical analysis confirms the effectiveness of the proposed hybrid feature selection model, representing a significant advancement in the field. The integration of RFE and RST optimizes the feature selection process and bridges the gap in interpretability, offering robust solutions for applications where accuracy and understanding of model decisions are crucial.

Abdoul Aziz Diallo

and 3 more

Air quality is an important part of environmental health, having serious consequences for human health and well-being. The Air Quality Index (AQI) is a frequently used metric for assessing air quality in various areas and at different times. However, AQI data, like many other types of environmental data, can contain outliers - data points that deviate significantly from other observations, indicating exceptionally good or poor air quality, a critical step in identifying and understanding extreme pollution episodes that can have serious environmental and public health consequences. These outliers can be caused by a variety of variables, including measurement mistakes, odd meteorological circumstances, and pollution occurrences. While outliers can occasionally give useful information about these unusual conditions, they can also skew studies and models if they are not adequately accounted for. This paper describes a hybrid method for detecting outliers in data, AQI data are used in this study. The model uses a stacked machine learning model that incorporates K-means clustering, Random Forest (RF), and Gradient Boosting Classifier (GBC). K-means is used for initial categorization, followed by RF model training, and ultimately, the RF output is used as input for the GBC to generate the final classification. The performance of this stacked machine learning model is examined and compared to single models using the Accuracy measure. The findings show that the suggested technique is efficient, with an accuracy of 0.99, showing its potential for effective outlier detection in data.

Isaac Kega

and 3 more

Recently, the ever-increasing complexity of datasets has necessitated the development of sophisticated techniques to uncover meaningful patterns and interactions within the data. This paper investigates the synergy between Rough Set Theory and Association Rule Mining, which is a potent approach to detecting interactions and enhancing the prediction capabilities of machine learning models. The proposed framework leverages the Greedy Heuristic Method for reduct generation, an established technique in Rough Set Theory, to efficiently identify relevant features and reduce the dimensionality of the dataset. Furthermore, Association Rule Mining extracts association rules from the data, revealing interesting relationships and dependencies among the features. These association rules are transformed into binary values, representing the detected interactions, to create a concise yet informative representation of the data’s intrinsic relationships. This binary representation is ideal for integration into machine learning models, enabling them to exploit the discovered interactions and gain a more comprehensive understanding of the underlying patterns. To assess the effectiveness of our proposed framework, we propose a comprehensive experiment involving a weather dataset scraped from www.wunderground.com for Kariki farm in the Juja sub-county, Kiambu County, Kenya. Using detected interactions, we modelled them to base machine learning models, including Naive Bayes, Decision Trees, Support Vector Machines (SVM), and Logistic Regression models. We compared the performance of these models while using the detected interactions versus not using the detected interactions. Through extensive experimentation, we demonstrate that our proposed approach is more effective than traditional machine learning models without interaction detection. Our results indicate that our interaction detection method framework significantly improves the prediction accuracy of the tested models on the benchmark datasets. This enhancement in accuracy highlights the practical relevance and potential benefits of adopting our approach to uncover valuable insights from datasets.