Pancreatic Cancer Classification Using Missing Data Imputation And Cluster-Based Undersampling Methods: A Comparative Analysis With Multiple Machine Learning Algorithms

Título: Pancreatic Cancer Classification Using Missing Data Imputation And Cluster-Based Undersampling Methods: A Comparative Analysis With Multiple Machine Learning Algorithms

Autores: Wanessa L. B. Sena, Renata F. P. Neves

Resumo: Missing values and class imbalance are issues frequently found in databases from real-world scenarios, including cancer classification. Impacts on the performance of Machine Learning (ML) models can be observed if these issues are not properly addressed prior to the analysis. In this paper, a combined solution with missing data imputation using kNN and cluster-based undersampling using k-means is proposed, focusing on pancreatic cancer classification. Different data subsets were generated by combining different preprocessing methods and the performance was analyzed using a ML analysis pipeline from a previous study. This pipeline implements ten ML classifiers, including Random Forest (RF), Support Vector Machine (SVM) and Artificial Neural Network (ANN). All data subsets presented a significant improvement (p<0.05 with Students T-Test) in the performance of most ML algorithms when compared with the results obtained when the pipeline was first evaluated. Results suggest that kNN and k-means can be used in the data preprocessing phase to overcome missing values and class imbalance issues and improve the classification accuracy

Palavras-chave: Machine Learning, k-means clustering, kNN, undersampling, missing data imputation, classification

Páginas: 6

Código DOI: 10.21528/CBIC2023-142

Artigo em pdf: CBIC_2023_paper142.pdf

Arquivo BibTeX: CBIC_2023_142.bib