•  
  •  
 

Abstract

Dimensionality reduction techniques streamline machine learning by reducing data complexity, improving model accuracy, and cutting computational costs. They remove noise and irrelevant features, making models faster and more efficient. These techniques also enhance data visualization and interpretation by condensing data into manageable, insightful dimensions. Ultimately, dimensionality reduction leads to simpler, more interpretable models without sacrificing critical information, making it a cornerstone of efficient data analysis and machine learning applications. Theoretically, feature extraction tends to create new features that encapsulate more information by combining multiple existing features, resulting in more concentrated and informative features. In contrast, feature selection involves choosing a subset of the original features without altering their content. In this paper, a feature selection method based on Principal Component Analysis (PCA) is proposed, along with a comparative study of PCA performance as a feature extraction technique and the newly proposed feature selection method. The proposed feature selection method utilizes the variance scored by the principal component to identify the features with the most effect on the principal component's variance and selects them as the best set of features.

Experimental results demonstrate that the proposed PCA-based feature selection achieves comparable or improved performance across various classifiers, while maintaining high accuracy and precision, even with fewer features. For instance, when using the proposed method with Network Security Laboratory - Knowledge Discovery in Databases (NSL-KDD) to select only one feature and employing six different classifiers (Decision Tree, Naive Bayes, Logistic Regression, K-Neighbors Classifier, XGBoost, and AdaBoost) to evaluate performance, the accuracy of 80.88%, 81.29%, 43.07%, 44.53%, 84.94%, and 82.87% were obtained using the listed classifiers in the same order. On the other hand, when using PCA for feature extraction the following accuracy values, listed in the same classifier order, were obtained: 76.64%, 76.10%, 43.07%, 47.40%, 80.57%, and 82.05%, demonstrating that the proposed method delivers higher accuracies. Similarly, for the mushroom dataset, the accuracies were 51.38%, 51.38%, 48.62%, 51.06%, 87.08%, and 86.22%, compared to 50.14%, 50.30%, 50.88%, 50.59%, 73.60%, and 71.51%.

Share

COinS