Evolutionary Feature Selection in Machine Learning

16 November 2017

Tags

Evolutionary Feature Selection in Machine Learning

When we want to fit a Machine Learning (ML) model to a big dataset, it is often recommended to pre-process the input data to obtain better results carefully. Although the fact that more data leads to better outcomes has a wide acceptance, this is not necessarily true when referred to the number of variables in our data. Some variables may be noisy, redundant, and not useful.

The process of removing useless variables is known as Feature Selection (FS) and is a ubiquitous task that precedes the application of any data analysis technique. More formally, it is the process of selecting a subset of the features of a dataset that is optimal according to some criterion, to

Improve the model obtained by an ML algorithm, because it reduces overfitting (fewer features lead to a more straightforward model which is less prone to overfitting and easier to understand and interpret), and also reduces the need of many examples when there are many variables (alleviating the curse of dimensionality)
Reduce the size of the dataset, which can enable us to run a non-distributed algorithm on our data if the algorithm is available off-the-shelf in a non-parallel implementation, for instance as an R or Python package. Fewer variables also reduce the amount of time needed for training the algorithm

There are roughly three kinds of FS methods:

Filter methods are based on data-related measures and are independent of the algorithm we want to run later on the reduced dataset. Mainly they detect useless features according to mathematical or statistical criteria or concerning metrics like the amount of additional information each element provides. These are the fastest methods.
Embedded methods, where we select features at the same time we fit our ML model. The output of this process is usually the model together with the best feature subset (for that specific model). Regularization techniques are a clear example.
Wrapper methods perform a search in the space of feature combinations, guided by some ML model to assess candidate subsets of features and make them comparable. Such models can be different from the model we eventually want to fit our reduced data after FS ends. They are the most time-consuming because many candidate subsets evaluated via an ML algorithm (“many” does not mean “every possible feature combination,” of course). Wrapper methods commonly require an intelligent search (heuristic optimization) procedure to find an optimal feature combination.

Focusing on the wrapper methods, we can imagine candidate solutions as vectors like (0,0,1,1,0) meaning that only features 3 and 4 of our data have been considered relevant and the rest of columns are discarded. When the number of variables (i.e., the length of the binary vector, N) is high, the number of possible combinations is 2^N which makes an exhaustive search computationally unfeasible. Metaheuristics constitute a broad class of search algorithms that can deal with substantial search spaces in optimization problems. They explore the solution space to find reasonable solutions within a reasonable time. Although finding the global optimum is not guaranteed, they work well in practice.

The role of Genetic Algorithms and CHC in Feature Selection

A favorite class of metaheuristics that are famous for their excellent performance while being easy to implement are population-based metaheuristics. They build a group of candidate solutions (called chromosomes, i.e., binary vectors like in the previous example) that evolve together as in natural evolution (as proposed by Charles Darwin’s The origin of species in 1859). The most well-known subclass of population-based metaheuristics is Genetic Algorithms, developed in 1975.

At the initial generation, a set of M (typically between 50 and 200) candidate solutions (in our case, M binary vectors of length N) get generated randomly (the generation mechanism may vary). From them, and using some of the strategies we will comment later, M solutions are selected (with replacement, hence some may appear more than once) to form the parent’s population. They get mated in random pairs, and each pair will cross with a particular probability Pc (which usually is quite high, between 0.7 and 0.9) to generate two offspring (when the parents do not intersect, the offspring is assumed to be the parents themselves). Then, each of the M resulting offspring may suffer a spontaneous mutation with a probability Pm (which is usually low, between 0.1 and 0.3). The resulting individuals then get evaluated according to a specific fitness function, which in our Feature Selection problem is the performance of a k-NN classifier on the input data with the selected feature subset, as will be explained in the next section. Finally, the new population replaces the old one completely (this is the case for a Generational GA), and the loop repeats. The stopping criterion consists in calling the fitness function a maximum number of times. Most often, the algorithm incorporates elitism, meaning that the best solution from the old population is always kept and replaces the worst solution from the newly generated population.

Many components are problem-specific and have to be carefully thought out for each problem, such as the crossover and mutation operators to ensure they do not produce unfeasible solutions in our problem domain. Moreover, the mechanism to assess the individuals is also problem-specific. In our case, to measure how good a subset of a feature is, we employ a classifier trained and evaluated only with the subset of elements selected according to the solution assessed at the time (i.e., those features having a 1 in the solution). Note the classifier will be trained many times, one per chromosome being evaluated, so ideally it should be a fast classifier just able to estimate the fitness of candidate solutions. K-fold cross-validation has to be used here to have a reasonable estimation of the performance of the classifier on a solution, and one must use the exact same partitions when evaluating any solution. Therefore the dataset must be partitioned just once to determine the folds at the beginning of the algorithm.

CHC is a particular type of Genetic Algorithm introduced in a paper first presented at the First Workshop on Foundations of Genetic Algorithms in 1990 (see [1]). It is a binary-coding genetic algorithm with a high selective pressure and elitism selection strategy, with additional components that introduce diversity. The inventor of CHC, Larry Eshelman, defines it as a Genetic Algorithm with some essential modifications:

Half Uniform Crossover (HUX): randomly select half of the genes that are different from the parents, and exchange them. This maximizes the Hamming distance between the offspring and their parents (see the paper for the theoretical motivation).
Elitist population replacement: join the current and the offspring populations together, and then select the M best individuals to form the new population.
Incest prevention: M/2 pairs are formed with the elements of the population. In each pair, if the individuals should cross according to the crossover probability, they are not allowed to finally mate if their Hamming distance exceeds a threshold (usually initialized to = /2, where the chromosome length is), meaning they are too similar. The threshold is decremented by 1 when no offspring is obtained in one generation, which indicates that the algorithm is converging.
Restarting process: when = 0 (several generations without any new offspring), the population is considered stagnated, and we generate a new population: the best individual remains, and the rest are strongly mutated (using the existing individuals as templates for creating these new individuals).

A proposal for the fitness function classifier is to use the performance of a k-NN classifier on the dataset with the selected features and Leave-One-Out (LVO) to classify each example. Since it is a lazy learner, training does not actually occur – the prediction phase is the costly step. Furthermore, since the datasets given as input may be imbalanced, it is advisable to use the Area Under the ROC curve (AUC) instead of accuracy. Finally, because the aim is to select features, the fitness function should be a weighted sum of the AUC and the feature reduction rate:

Fitness = 0.5 x AUC + 0.5 x Reduction rate = 0.5 x AUC + 0.5 x (1 – (#Fselected/#Ftotal))

The approach discussed here has exhibited good results in research studies and is not usually readily available in most statistical packages – so why not go and try implementing it yourself? As always, the importance resides in what you’ll learn during the process, not the results.

[1] The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. In: G. J. E. Rawlins (Ed.): Foundations of Genetic Algorithms, vol I, pp. 265 – 283. California, Morgan Kauffman Publishers Inc. (1991)

_______________________________________________________________________

This post is a roundup of articles originally published on the Stratio blog. You can read the originals here and here.

Ready to become one of the decade’s hottest professions?

If you’re eager for more information about the Data Science Bootcamp, get your own copy of our information booklet here. If you’re all set on joining our next edition, get started on your application.