This paper presents a feature selection methodology that can be applied to datasets containing a mixture of continuous and categorical variables. Using a Genetic Algorithm (GA), this method explores a dataset and selects a small set of features relevant for the prediction of a binary (1/0) response. Binary classification trees and an objective function based on conditional probabilities are used to measure the fitness of a given subset of features. The method is applied to health data in order to find factors useful for the prediction of diabetes. Results show that our algorithm is capable of narrowing down the set of predictors to around 8 factors that can be validated using reputable medical and public health resources.
Revised: February 4, 2016 |
Published: September 1, 2013
Citation
Heredia-Langner A., K.H. Jarman, B.G. Amidan, and J.G. Pounds. 2013.Genetic Algorithms and Classification Trees in Feature Discovery: Diabetes and the NHANES database. In Proceedings of the 2013 World Congress in Computer Science, Computer Engineering, and Applied Computing (WORLDCOMP'13), July 22-25, 2013, Las Vegas, Nevada. Athens, Georgia:CSREA Press.PNNL-SA-94471.