{"id":7509,"date":"2017-11-16T11:22:31","date_gmt":"2017-11-16T10:22:31","guid":{"rendered":"http:\/\/www.ie.edu\/exponential-learning\/blog\/?p=7509"},"modified":"2020-12-02T12:17:01","modified_gmt":"2020-12-02T11:17:01","slug":"evolutionary-feature-selection-machine-learning","status":"publish","type":"post","link":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/data-science\/evolutionary-feature-selection-machine-learning\/","title":{"rendered":"Evolutionary Feature Selection in Machine Learning"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">By <\/span><a href=\"https:\/\/www.linkedin.com\/in\/pablo-villacorta-iglesias\/\" target=\"_blank\" rel=\"noopener\"><b>Pablo Villacorta Iglesias<\/b><\/a><\/p>\n<p><span style=\"font-weight: 400;\">When we want to fit a <strong>Machine Learning (ML) model to a big dataset<\/strong>, it is often recommended to pre-process the input data to<strong> obtain better results carefully<\/strong>. Although the fact that more data leads to better outcomes has a wide acceptance, this is <strong>not necessarily true when referred to the number of variables in our data<\/strong>. <strong>Some variables may be noisy, redundant, and not useful.<\/strong> <\/span><\/p>\n<p><span style=\"font-weight: 400;\">The <strong>process of removing useless variables<\/strong> is known as <strong>Feature Selection (FS)<\/strong> and is a ubiquitous task that <strong>precedes the application of any data analysis technique<\/strong>. More formally, it is the process of selecting a subset of the features of a dataset that is optimal according to some criterion, to <\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Improve the model obtained by an ML algorithm<\/strong>, because it <strong>reduces overfitting<\/strong> (fewer features lead to a more straightforward model which is less prone to overfitting and easier to understand and interpret), and also r<strong>educes the need of many examples when there are many variables<\/strong> (alleviating the curse of dimensionality)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Reduce the size of the dataset<\/strong>, which can enable us to run a non-distributed algorithm on our data if the algorithm is available off-the-shelf in a non-parallel implementation, for instance as an R or Python package. Fewer variables also reduce the amount of time needed for training the algorithm<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">There are roughly three kinds of FS methods: <\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Filter methods<\/strong> are <strong>based on data-related measures and are independent of the algorithm we want to run later on the reduced dataset<\/strong>. Mainly they detect useless features according to mathematical or statistical criteria or concerning metrics like the amount of additional information each element provides. These are the <strong>fastest methods<\/strong>.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Embedded methods<\/strong>, where we <strong>select features at the same time we fit our ML mode<\/strong>l. The output of this process is <strong>usually the model together with the best feature subset<\/strong> (for that specific model). Regularization techniques are a clear example.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Wrapper methods perform a search in the space of feature combinations, guided by some ML model to assess candidate subsets of features and make them comparable<\/strong>. Such models can be different from the model we eventually want to fit our reduced data after FS ends. They are the <strong>most time-consuming because many candidate subsets evaluated via an ML algorithm<\/strong> (\u201cmany\u201d does not mean \u201cevery possible feature combination,&#8221; of course). Wrapper methods commonly require an intelligent search (heuristic optimization) procedure to find an optimal feature combination.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Focusing on the wrapper methods, we can imagine candidate solutions as vectors like (0,0,1,1,0) meaning that only features 3 and 4 of our data have been considered relevant and the rest of columns are discarded. When the number of variables (i.e., the length of the binary vector, N) is high, the number of possible combinations is 2^N which makes an exhaustive search computationally unfeasible. <strong>Metaheuristics constitute a broad class of search algorithms that can deal with substantial search spaces in optimization problems.<\/strong> They explore the solution space to <strong>find reasonable solutions within a reasonable time<\/strong>. Although finding the global optimum is not guaranteed, they work well in practice.<\/span><\/p>\n<h4>The role of Genetic Algorithms and CHC in Feature Selection<\/h4>\n<p><span style=\"font-weight: 400;\">A favorite class of metaheuristics that are famous for their excellent performance while being easy to implement are <strong>population-based metaheuristics<\/strong>. They <strong>build a group of candidate solutions<\/strong> (called chromosomes, i.e., binary vectors like in the previous example) <strong>that evolve together as in natural evolution<\/strong> (as proposed by Charles Darwin\u2019s <\/span><i><span style=\"font-weight: 400;\">The origin of species<\/span><\/i><span style=\"font-weight: 400;\"> in 1859). The most well-known subclass of population-based metaheuristics is <strong>Genetic Algorithms<\/strong>, developed in 1975.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At the initial generation, a set of M (typically between 50 and 200) candidate solutions (in our case, M binary vectors of length N) get generated randomly (the generation mechanism may vary). From them, and using some of the strategies we will comment later, <strong>M solutions are selected<\/strong> (with replacement, hence some may appear more than once)<strong> to form the parent&#8217;s population<\/strong>. They <strong>get mated in random pairs, and each pair will cross with a particular probability Pc<\/strong> (which usually is quite high, between 0.7 and 0.9) <strong>to generate two offspring<\/strong> (when the parents do not intersect, the offspring is assumed to be the parents themselves). Then, <strong>each of the M resulting offspring may suffer a spontaneous mutation with a probability Pm<\/strong> (which is usually low, between 0.1 and 0.3). The <strong>resulting individuals then get evaluated according to a specific fitness function<\/strong>, which in our Feature Selection problem is the performance of a k-NN classifier on the input data with the selected feature subset, as will be explained in the next section. Finally, <strong>the new population replaces the old one completely (this is the case for a Generational GA), and the loop repeats<\/strong>. The stopping criterion consists in calling the fitness function a maximum number of times. <strong>Most often, the algorithm incorporates elitism<\/strong>, meaning that <strong>the best solution from the old population is always kept and replaces the worst solution from the newly generated population<\/strong>.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Many components are problem-specific and have to be carefully thought out for each problem, such as the crossover and mutation operators to ensure they do not produce unfeasible solutions in our problem domain. Moreover, the mechanism to assess the individuals is also problem-specific. In our case, <strong>to measure how good a subset of a feature is, we employ a classifier trained and evaluated only with the subset of elements selected according to the solution assessed at the time<\/strong> (i.e., those features having a 1 in the solution). Note the classifier will be trained many times, one per chromosome being evaluated, so<strong> ideally it should be a fast classifier just able to estimate the fitness of candidate solutions<\/strong>. K-fold cross-validation has to be used here to have a reasonable estimation of the performance of the classifier on a solution, and one must use the exact same partitions when evaluating any solution. Therefore the dataset must be partitioned just once to determine the folds at the beginning of the algorithm.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><strong>CHC<\/strong> is a particular type of Genetic Algorithm introduced in a paper first presented at the First Workshop on Foundations of Genetic Algorithms in 1990 (see [1]). It is a <strong>binary-coding genetic algorithm with a high selective pressure and elitism selection strategy, with additional components that introduce diversity<\/strong>. The inventor of CHC, Larry Eshelman, defines it as a Genetic Algorithm with some essential modifications:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Half Uniform Crossover (HUX): <\/strong>randomly select half of the genes that are different from the parents, and exchange them. This maximizes the Hamming distance between the offspring and their parents (see the paper for the theoretical motivation).<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Elitist population replacement<\/strong>: join the current and the offspring populations together, and then select the M best individuals to form the new population.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Incest prevention:<\/strong> M\/2 pairs are formed with the elements of the population. In each pair, if the individuals should cross according to the crossover probability, they are not allowed to finally mate if their Hamming distance exceeds a threshold \u00a0(usually initialized to \u00a0= \/2, where the chromosome length is), meaning they are too similar. The threshold is decremented by 1 when no offspring is obtained in one generation, which indicates that the algorithm is converging.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><strong>Restarting process:<\/strong> when \u00a0= 0 (several generations without any new offspring), the population is considered stagnated, and we generate a new population: the best individual remains, and the rest are strongly mutated (using the existing individuals as templates for creating these new individuals).<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">A proposal for the fitness function classifier is to use the performance of a k-NN classifier on the dataset with the selected features and Leave-One-Out (LVO) to classify each example. Since it is a lazy learner, training does not actually occur &#8211; the prediction phase is the costly step. Furthermore, since the datasets given as input may be imbalanced, it is advisable to use the Area Under the ROC curve (AUC) instead of accuracy. Finally, because the aim is to select features, the fitness function should be a weighted sum of the AUC and the feature reduction rate:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Fitness = 0.5 x AUC + 0.5 x Reduction rate = 0.5 x AUC + 0.5 x (1 &#8211; (#F<\/span><span style=\"font-weight: 400;\">selected<\/span><span style=\"font-weight: 400;\">\/#F<\/span><span style=\"font-weight: 400;\">total<\/span><span style=\"font-weight: 400;\">))<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The approach discussed here has exhibited good results in research studies and is not usually readily available in most statistical packages \u2013 so why not go and try implementing it yourself? As always, <strong>the importance resides in what you\u2019ll learn during the process<\/strong>, not the results.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">[1] The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. In: G. J. E. Rawlins (Ed.): Foundations of Genetic Algorithms, vol I, pp. 265 &#8211; 283. California, Morgan Kauffman Publishers Inc. (1991)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">_______________________________________________________________________<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This post is a roundup of articles originally published on the <\/span><a href=\"http:\/\/www.stratio.com\/blog\/\" target=\"_blank\" rel=\"noopener\"><b>Stratio<\/b><\/a> <span style=\"font-weight: 400;\">blog. You can read the originals <\/span><a href=\"http:\/\/www.stratio.com\/blog\/feature-selection-big-datasets-1\/\" target=\"_blank\" rel=\"noopener\"><b>here<\/b><\/a> <span style=\"font-weight: 400;\">and <\/span><a href=\"http:\/\/www.stratio.com\/blog\/evolutionary-feature-selection-in-big-datasets-part-2\/\" target=\"_blank\" rel=\"noopener\"><b>here<\/b><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h5>Ready to become one of the decade\u2019s hottest professions?<\/h5>\n<p><span style=\"font-weight: 400;\">If you\u2019re eager for more information about the <\/span><a href=\"http:\/\/bootcamp-hst-datascience.ie.edu\/\" target=\"_blank\" rel=\"noopener\"><b>Data Science Bootcamp<\/b><\/a><span style=\"font-weight: 400;\">, get your own copy of our information booklet <\/span><a href=\"http:\/\/landings.ie.edu\/bootcampland-data-science-3?gclid=CK_zyOb37NQCFcMYGwod51wDNw\" target=\"_blank\" rel=\"noopener\"><b>here<\/b><\/a><b>. <\/b><span style=\"font-weight: 400;\">If you\u2019re all set on joining our next edition, <\/span><a href=\"https:\/\/secure.ie.edu\/exedu-app\/?program=IEXL-ENG-DSBC\" target=\"_blank\" rel=\"noopener\"><b>get started on your application<\/b><\/a><span style=\"font-weight: 400;\">. <\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Pablo Villacorta Iglesias When we want to fit a Machine Learning (ML) model to a big dataset, it is often recommended to pre-process the input<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":2,"featured_media":7556,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[384],"tags":[189,88,191,109,44,193,194,192],"_links":{"self":[{"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/posts\/7509"}],"collection":[{"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/comments?post=7509"}],"version-history":[{"count":1,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/posts\/7509\/revisions"}],"predecessor-version":[{"id":7510,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/posts\/7509\/revisions\/7510"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/media\/7556"}],"wp:attachment":[{"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/media?parent=7509"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/categories?post=7509"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ie.edu\/lifelong-learning\/blog\/wp-json\/wp\/v2\/tags?post=7509"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}