Feature Selection Based on the Shapley Value Shay Cohen and Eytan Ruppin School of Computer Sciences Tel-Aviv University, Tel-Aviv 69978, Israel cshay,ruppin @post.tau.ac.il ¡ Gideon Dror Department of Computer Science Academic College of Tel-Aviv Yaffo, Tel-Aviv, 64044, Israel gideon@mta.ac.il Abstract containing i.i.d. sampled instances of of the form (#'¤¢ © §¥ £ 20 ) 1 61 5 are available: Train, Validation and Test represent- 4¥ 3 We present and study the Contribution-Selection al- ing the training set, validation set and test set respectively. gorithm (CSA), a novel algorithm for feature se- A) @ 5 Given an induction algorithm and a set , 97 8 #!"! ¥ B ¥ GE6C © F¢ D lection. The algorithm is based on the Multi- stands for a classifier constructed from the training set us- perturbation Shapley Analysis, a framework which ing the induction algorithm, after its input variables were nar- relies on game theory to estimate usefulness. The labels each in- rowed down to the ones in S, namely GH26C © F¢ D algorithm iteratively estimates the usefulness of hfec@ b bd g g stance of the form , , with QVU ¤SF "! RQG¢ ©T TI ¥ PI F a`¤W YX 7 7 features and selects them accordingly, using either a value in the domain of . The task of feature selection is § forward selection or backward elimination. Empir- to choose a subset of the input variables, that would maxi- 7 ical comparison with several other existing feature mize the performance of the classifier on the test set. In what selection methods shows that the backward elimi- follows we shall focus on optimizing accuracy of the classi- nation variant of CSA leads to the most accurate fier, although we could as easily optimize other performance classification results on an array of datasets. measure such as the area under the ROC curve, balanced error rate etc. The rest of this paper is organized as follows: Section 2 in- 1 Introduction troduces the necessary background from game theory with a Feature selection refers to the problem of selecting input vari- detailed description of the CSA algorithm; Section 3 provides ables, otherwise called features, that are relevant to predicting an empirical comparison of CSA with several other feature a target value for each instance in a dataset. Feature selection selection methods, accompanied by an analysis of the results; has several potential benefits: defying the curse of dimension- Section 4 discusses the empirical results and provides further ality to enhance the prediction performance, reducing mea- insights to the success of the backward elimination version of surement and storage requirements and reducing training and the CSA algorithm. prediction times. This paper focuses on the first issue, namely selecting input variables in an attempt to maximize the perfor- 2 Classification as a Coalitional Game mance of a classifier on previously unseen data. Cooperative game theory introduces the concept of "coali- In this paper, we suggest to recast the problem of feature tional games", in which a set of players is associated with a selection in the context of coalitional games, a notion from payoff, a real function that denotes the benefit achieved by game theory. This perspective yields an iterative algorithm different sub-coalitions in a game. Formally, a coalitional for feature selection, the Contribution-Selection algorithm A) @ 5 game is defined by a pair where is the (CSA), intent on optimizing the performance of the classifier r#pH¢ © q¥ i si #ut t¥ B¥ set of all players and , for every , is a real number on unseen data. The algorithm combines both the filter and wvGq © 7¢ ax7 8 i associating a worth with the coalition . Game theory further wrapper approaches. However, unlike filter methods, features 7 pursues the question of representing the contribution of each are reranked on each step by using the classifier as a black box. The ranking is based on the Shapley value [Shapley, player to the game by constructing a value function, which 1953], a well known concept from game theory, to estimate assigns a real-value to each player. The values correspond to the contribution of the players in achieving a high payoff. the importance of each feature for the task at hand, specifi- cally taking into account interactions between features. The contribution value calculation is based on the Shapley value [Shapley, 1953]. An intuitive example of the potential Throughout the paper we use the following notations. The distribution from which the dataset instances are drawn is rep- use of the Shapley value can be provided in an academic set- resented by two variables , where ting. Assume that you are a Professor running a lab, and, once ¨¦¤¢ © §¥ £ &%#!"! £ © $ ¥ ¥ ¢ represents the input variables (vector of features), and and for all, you have decided to distribute the yearly bonus to § your students in a fair manner, that reflects the actual con- represents a discrete target value (class) for . Three sets £ tribution of each student to the academic success of the lab. case, the Shapley value of a feature, that measures its con- During the year, the students form spontaneous "coalitions" tribution to the combined performance measure, is just the sum of the corresponding Shapley values. The linearity of of groups of students, each such group works and publishes the Shapley value is a consequence of this property. Namely, a paper summarizing its work (these coalitions may also be if the payoff function is multiplied by a real number then assembled by the Professor). Every paper gets a rank, (e.g., § 8 @ q 9§ all Shapley values are scaled by namely . its impact factor), composing its "payoff function". Based on 8 8 8 rE ¢ I rH¢ I ©q ©q In other words, multiplying the performance measure by a this annual data of the students' coalitions and their associ- ated payoffs, the Shapley value provides a fair and efficient constant does not change the ranking of the features, a vital property for any scheme that ranks features by their 'impor- way to distribute the bonus to each individual student accord- tance'. ing to his contribution over the year. The Shapley value is defined as follows. Let the marginal 2.1 Estimating Features Contribution Using the importance of player to a coalition , with , be ¡W W 7 'Y 7 MSA ¢ ) 5 (1) ¤vGq £ 7¢ ¦© ¥ ww¤¢ I ©7 W wvGq © 7¢ The calculation of the Shapley value requires summing over all possible subsets of players, which is impractical in our Then, the Shapley value is defined by the payoff case. [Keinan et al., 2004] have presented an unbiased esti- @ ¢ § mator for the Shapley value by uniformly sampling permu- (2) rH#I © q¢ 42 #I ¤¨I ©© ¢ 7¢ tations from . Still, the estimator considers both large and ©B ¨ small features sets to calculate the contribution values. In our feature selection algorithm, we use the Shapley value heuris- where is the set of permutations over , and is the 2¢I 7 i © tically to estimate the contribution value of a feature for the set of players appearing before the th player in permutation W task of feature selection. Since in most realistic cases we as- . The Shapley value of a player is a weighted mean of its sume that the size of significant interactions between fea- marginal value, averaged over all possible subsets of players. A tures is much smaller than the number of features, , we will Transforming these game theory concepts into the arena of B limit ourselves to calculating the contribution value from per- feature selection, in which one attempts at estimating the con- mutations sampled from the whole set of players, with be- tribution of each feature in generating a classifier, the players A ing a bound on the permutation size. Notice that most fil- are mapped to the features of a dataset and the payoff is i @ ter methods are equivalent to using where no inter- , which measures represented by a real-valued function A ¤¢ q 7 © actions are taken into account. Feature selection using Ran- the performance of a classifier generated using the set of fea- dom Forests [Breiman, 2001] is equivalent to . The tures . Finally, the usage of the Shapley value for feature ECA DB B 7 bounded estimated contribution value becomes selection may be justified by its axiomatic qualities: @ ¢ F Axiom 1 (Normalization or Pareto optimality) For any game § Gg g © ¨I q¢ #2 ¨I ¤¨I 7¢ ©© ¢ it holds that H3# ¡ w© ¢ I © 4'H¢ q¥ i q ¦H¢ q ©i ! I where is the set of sampled permutations on subsets of G In the context of feature selection, this axiom implies that size . The usage of bounded sets coupled with the method A the performance on the dataset is divided fully between the for the Shapley value estimation, yields an efficient and ro- different features. bust way to estimate the contribution of a feature to the task Axiom 2 (Permutation invariance or symmetry) For any of classification. For a detailed discussion of the MSA frame- § and permutation on it holds that work and its theoretical background see [Keinan et al., 2004]. © ¢ I © 4'H¢ q¥ i i q ¢ %$I # § " ©q 2.2 The Contribution-Selection Algorithm This axiom implies that the value is not altered by arbitrarily The Contribution-Selection algorithm (CSA), described in renaming or reordering the features. detail in Figure 1, is iterative in nature, and can either adopt Axiom 3 (Preservation of carrier or dummy-property) For a forward selection or backward elimination approach. Us- ) 5 any game such that for every &¤¢ £7 ing the subroutine contribution, it ranks each feature accord- r#pH¢ © q¥ i q W w© w¤¢ q ©7 97 8 (wrH¢ I § it holds that ' ©q ing to its contribution value, and then selects features with I i the highest contribution values with forward selection (using This axiom implies that a dummy feature that does not influ- the sub-routine selection)1 , or eliminates features with the P ence the classifier's performance indeed receives a contribu- lowest contribution values with backward elimination (using tion value 0. elimination). It repeats the phases of calculating the contri- Axiom 4 (Additivity or aggregation) For any two games bution values of the remaining features given those already 2¡#I § § ¨I § 3© and it holds that selected (or eliminated), and selecting (or eliminating) new 0'H¢ )¥ i ) 1 q¢ )¢ 1 © 4'H¢ q¥ i © © 4I q¢ © where features, until the contribution values of all candidate features 54¢ )1q 76w¤¢ ) 1©7 w¤t© © 7¢ a q wv © 7¢ This axiom applies to a combination of two different payoffs 1 Alternatively, the selection sub-routine can use a forward selec- based on the same set of features. For a classification task tion technique instead of ; features are added in ascending order Q these may be, for example, accuracy and area under the ROC of their contribution values, as long as the classifier's performance curve or false positive rate and false negative rate. In such improves. 2 I¥ A¥ ¢ Contribution-Selection-Algorithm( ; ) Name Classes Features Train Size Test Size Reuters1 3 1579 145 145 1. := A P ¥¢9¢P I ¤ £ P¡ ¦ Reuters2 3 1587 164 164 2. for each A P ¥¢9¢P ©§'C ¤ £ P¡ I ¨ Y Arrhythmia 2 278 280 140 Internet Ads 2 1558 2200 800 2.1. := contribution( , ;) A A P ¥¢9¢P I ¤ £ P¡ C ¢ Dexter 2 20000 300 300 3. if Arcene 2 10000 100 100 ¢I ) 5 3.1. := selection( ;, ) A P ¥¢9¢P I ¤ £ P¡ ¡A P ¥¢9¢P I £ ¤ £ P¡ I2000 2 2000 40 22 3.2. goto 2 Table 1: Description of datasets used. else 3.3. return selected higher is, the more likely that features with redundant con- I @ tribution will be selected. Although minimizes the I Figure 1: The Contribution-Selection algorithm in its forward se- redundancy dependencies of the features, increasing accel- I lection version. is the input set of features, is a contribution ! erates the algorithm's convergence. The algorithm's halting value threshold, is the maximal permutation size for calculating ¢ " criterion depends on , which designates a trade-off between the contribution values, is the number of features selected in each Q the number of selected features, and the performance of the phase. The contribution routine calculates the contribution value of feature with the pay off function described in this section. The classifier on the validation set. With the forward selection # ¢ selection routine selects at most features with highest contribu- Q means that CSA selects features as version, choosing ' tion values that exceed . In the backward elimination version, long as there exists a feature that is likely to improve the clas- ! ¢ the selection sub-routine is replaced with an elimination sub-routine sifier's performance, and selects smaller sets of features as ¢ which eliminates features in each phase and the halting criterion is $ is increased. Increasing has the opposite effect on the size changed accordingly. of the final set of features. The more intuitive halting crite- rion, to stop when no further performance gain is achieved, is ¢ too restrictive, while CSA's halting criterion enables the se- exceed a contribution threshold with forward selection (or ¢ lection of features proved useful at later stages, as verified fall below a contribution threshold with backward elimina- empirically over several datasets. tion). The algorithm, without further specification of the contri- 3 Results bution sub-routine, is a generalization of filter methods. How- ever, the main idea of the algorithm is that the contribution 3.1 The Data and Benchmark Algorithms sub-routine, unlike common filter methods, returns a contri- To test CSA empirically we ran a number of experiments bution value for each feature according to its assistance in im- on seven real-world datasets with number of features rang- proving the classifier's performance, which is generated using ing from 278 to 20,000 (Table 1): the Reuters1 dataset and a specific induction algorithm, and in conjunction with other the Reuters2 dataset both constructed following [Koller and features. Using the notation in Section 2 and assuming one Sahami, 1996] using the Reuters-21578 document collection; optimizes the accuracy level of the classifier, the contribution the Arrhythmia database from the UCI repository [Perkins sub-routine for forward selection calculates the contribution et al., 2003] ; the Internet Advertisements database from the values using the following payoff function : ¤¢ q 7 © UCI repository [Blake and Merz, 1998] which was collected 1. := A P ¥¢9¢P I £ ¤ £ P¡ for the research of identifying advertisements in web pages, 7 7 the Dexter text categorization dataset and the Arcene cancer 2. Generate a classifier from the training set, Train © ¢ D C F dataset, both from the NIPS 2003 workshop on feature selec- for all examples of the validation set, 3. Evaluate GH¢ D C ©F tion [Guyon, 2003] and the I2000 microarray colon cancer Validation dataset [Alon et al., 1999]. 4. Return the accuracy level, defined as In principle, CSA can work with any induction algorithm wvGq © 7¢ )(& % %' G I 9%8 3) ¥$@ 7 2 4 ' ¥"8 E3%62 1$ ' " % D#BI 97 76 C $A @ H 50 4 U . However, due to computational constraints we focused GI 7 #FI $A on fast induction algorithms or algorithms that may be ef- The case is an end case which is handled by return- ficiently combined into CSA. We experimented with Naive Gf7 ¦ ing the number of instances in the largest class divided by the Bayes, C4.5 and 1NN. For each of the datasets, we measured total number of instances (a classifier which always selects the training set accuracy of each classifier using ten-fold cross the most frequent class). Backward elimination is quite sim- validation on the whole set features. For each dataset, all sub- H ilar, and the payoff is calculated by sampling permutations sequent work used the induction algorithm , that gave the from the set of features left after each phase of elimination. highest cross validation accuracy, as detailed in Table 2. The maximal permutation size has an important role in de- A Eight different feature selection schemes were then com- ciding the contribution values of the different features, and pared on the datasets described above: should be selected in a way that ensures that different com- H I The induction algorithm without performing feature binations of features that interact together are inspected. Its selection to serve as a baseline. impact is demonstrated in Section 3. @ WUI 8 VT I The number of selected features for the selection sub- Regularized linear SVM using the package I SQ7 RP [Joachims, 1999]. Datasets that had more than two routine controls the redundancies of the selected features; the 3 Dataset L (Fwd.) (Bwd.) I I P A The Reuters1 dataset. Feature selection using Ran- ¤ Reuters1 NB 1 100 20 1500 dom Forests did best, yielding accuracy level of 100% Reuters2 NB 1 100 20 1800 with 30 features. Not too far behind is the CSA in its Arrhythmia C4.5 1 50 20 500 backward elimination version (98.6% with 51 features). Internet Ads 1NN 1 100 20 1500 [Koller and Sahami, 1996], for example, report that the Dexter C4.5 50 50 12 3500 Markov Blanket algorithm yields approximately 600 se- Arcene C4.5 100 100 5 10000 lected features with accuracy levels of 95% to 96% on I2000 C4.5 100 100 3 2000 this dataset. I The Reuters2 dataset. CSA with backward elimination Table 2: The parameters and the classifier used with the CSA al- did best, yielding accuracy level of 93% with 109 fea- gorithm for each dataset. is in the induction algorithm used with tures. For comparison, [Koller and Sahami, 1996] re- CSA (NB being Naive Bayes), is the number of features selected Q port that the Markov Blanket algorithm yields approx- in forward selection in each phase, is the number of features elim- $ inated in backward elimination in each phase, is the permutation imately 600 selected features with accuracy levels of " size and is the number of permutations sampled to estimate the ¡ 89% to 93% on this dataset. I contribution values. For an explanation how hyperparameters are The Arrhythmia dataset. This dataset is considered to chosen, see text. be a difficult one. CSA with backward elimination did best, yielding an accuracy level of 84% with 21 features. Forward selection with higher depth value ( ) did !¨ A '§ classes were split into few binary classification prob- better than wrapper, implying that one should consider lems. many features concomitantly to perform good feature se- I Filtering using mutual information and classification us- lection for this dataset. For comparison, the grafting al- H ing . We binned continuous domains to estimate the gorithm [Perkins et al., 2003] yields an accuracy level mutual information. of approximately 75% on this dataset. I The Internet Ads dataset. All the algorithms did approx- I Filtering using the Pearson correlation coefficient and imately the same, leading to accuracy levels between H classification using . 94% and 96% with CSA slightly outperforming the oth- I Random Forests feature selection [Breiman, 2001] and ers. Interestingly enough, the wrapper algorithm did not H classification using . select any feature; in the first phase, the 1NN algorithm had neighbors from both classes with the same distance I Feature selection using forward selection wrapper. Since for each feature checked, leading to arbitrary selection simple wrapper greedily selects a feature that most of one of the classes, and the classifier's performance improves the classifier's validation performance, it is @ was constant through all the phase, yielding zero contri- . equivalent to forward selection CSA with A bution values. However, when selecting the higher depth H I Classification using after performing feature selection levels, the simple 1NN algorithm was boosted up to out- with forward selection CSA and parameters as described perform classifiers such as SVM. in Table 2. The parameters and were chosen such thatA I The Dexter dataset. For the Dexter dataset, we used ¤ H the expected number of times that each feature is sam- algorithm (C4.5 decision trees) only for the process pled is higher than 5. The contribution value threshold of feature selection, and Linear SVM to perform the ¢ for stopping selection was . termination of fea- ' actual prediction on the features selected. This was ture selection was fixed by choosing a contribution value done because C4.5 did not give satisfying accuracy lev- ( ¢ threshold . No hyperparameter selection was per- ' els for any of the feature selection algorithms, and it ¢ formed on either , or . A is impractical to use SVM with CSA for large datasets. ¤ H To overcome the difference between the classifiers per- I Classification using after performing feature selection forming feature selection and the classifier used for the with backward elimination CSA and parameters as de- actual classification, we added an optimization phase scribed in Table 2. The parameters and were chosen A ¤ for the forward selection algorithm after it stopped. In such that the expected number of times that each fea- this phase, a ten-fold cross-validation is performed on ture is sampled is higher than 5. The contribution value ¢ the dataset in a similar way to the one used to opti- threshold for stopping elimination was . No hy- ' mize filter methods. The simple mutual information fea- perparameter selection was performed on either , or A ¤ ¢ ture selection performed best, followed closely by the Contribution-Selection algorithm in its backward elimi- To avoid overfitting on the validation set used for calcu- nation version and by Random Forests. This implies that lating the payoff with CSA, we used m-fold cross validation in Dexter the contribution of single features significantly instead of a single set. ¦ D¤A W¡ £P ¥ ¤¢ ¢ outweigh the contribution of feature combinations for RW B the task of classification. The forward selection algo- 3.2 Feature Selection and Classification Results rithm did as well as Linear SVM without feature selec- Table 3 summarizes the classifiers' performance on the test tion, but with a significantly lower number of features. I The Arcene dataset. Here, just as in the case of Dexter, set and the number of features selected in each of the ex- we use C4.5 for the process of feature selection, and Lin- periments. The accuracy levels are the fraction of correctly ear SVM to perform the actual prediction on the features classified test set instances: 4 Dataset Wrapper Fwd. Bwd. 0 Arrhythmia (slp=-1.21) Reuters1 92.4 (7) 96.5 (10) 98.6 (51) Dexter (slp=-1.20) -1 Reuters2 91.4 (5) 90.1 (14) 93.2 (109) Arrhythmia 70 (5) 74.2 (28) 84.2 (21) -2 Internet Ads - 95.6 (8) 96.1 (158) Dexter 80 (10) 92.6 (100) 93.3 (717) log frequency -3 Arcene 58 (7) 81 (600) 86 (7200) I2000 86.3 (550) 86.3 (500) 90.9 (1100) -4 No FS SVM Corr. MI RF 84.1 94.4 90.3 (20) 94.4 (20) 100 (30) -5 81.1 91.4 88.4 (20) 90.2 (5) 87.2 (21) 76.4 80 71.4 (20) 70 (20) 80 (40) -6 94.7 93.5 94.2 (15) 95.75 (70) 95.6 (10) -7 92.6 92.6 92.6 (1240) 94 (230) 93.3 (800) -5 -4.5 -4 -3.5 -3 -2.5 -2 83 83 83% (6600) 81 (5600) 82 (6000) log CV 86.3 72.7 81.8 (260) 90.9 (1060) 86.3 (100) Figure 2: Power-law distribution of contribution values. This log- Table 3: Comparison of accuracy levels and number of features se- log plot of the distribution of the contribution values (absolute value) lected in the different datasets. Upper table: Wrapper and Fwd/Bwd in the first phase for Arrhythmia and Dexter, prior to making any (CSA with forward selection/backward elimination with parameters feature selection, demonstrates a power law behavior. The corre- from Table 2). Bottom table: No FS (no feature selection), SVM sponding plots for the other datasets show identical power-law char- (linear SVM without feature selection), Corr (feature selection using acteristics (though with different slopes), and were eliminated for Pearson correlation), MI (feature selection using mutual informa- the sake of clarity. tion), RF (feature selection using Random Forests). Accuracy levels are calculated by counting the number of misclassified instances and given in percentages. The number of features selected is given in while with backward elimination, there is a gradual and rather brackets. stable increase in the contribution values of the eliminated features. The peaks in the graph of the contribution values in Figure 3A demonstrate that the contribution values do change selected. The CSA with backward elimination obtained as the CSA iterates . In this case, the selection of a single fea- better performance than the rest of the algorithms. I ture considerably increased the contribution value of another The I2000 dataset. CSA with backward elimination and feature, pointing at intricate dependencies between features. feature selection using mutual information yielded the Figures 2 and 3 also assist in explaining why back- best results. The poor performance of CSA with forward ward elimination usually outperforms several feature selec- selection can be explained by the poverty of data com- tion methods, including forward selection; due to the high paring to the number of features; the algorithm selected dimensionality of the datasets, a feature that assists in pre- in the first phases features which explain well the train- diction merely by coincidence, may be selected, on the ac- ing data by coincidence, and avoided from selecting fea- count of other truly informative features. Forward selection tures that truly contribution to the task of classification. is penalized severely in such case: among the few signifi- This phenomenon is explained in portrait in Section 3.3. cant features, some will not be chosen. However, backward In summary, in 5 out of the 7 datasets, CSA with backward elimination always maintains the significant features in the elimination achieved the best results. In all other cases, CSA non eliminated set; a feature that truly enhances the classi- achieved the second best result. fier's generalization will do so for the validation set as well, and will not be eliminated. This leads to a more stable gen- 3.3 A Closer Inspection of the Results eralization behavior for backward elimination on the test set The MSA, intent on capturing correctly the contribution of through the algorithm's progress (Figure 3). elements to a task, enables us to examine the distribution of the contribution values of the features. Figure 2 depicts a 4 Final Notes log-log plot of the distribution of the contribution values in the first phase for Arrhythmia and Dexter, prior to making The Contribution-Selection algorithm presented in this paper views the task of feature selection in the context of coali- any feature selection. This distribution follows a scale-free tional games. It uses a wrapper-like technique combined with power law, implying that large contribution values (in abso- a novel ranking method which is based on the Shapley con- lute value) are very rare, while small ones are quite common, tribution values of the features to the classification accuracy. justifying quantitatively the need of feature selection. The The CSA works in an iterative manner, each time selecting other datasets were also observed to possess similar power new features (or eliminating them) while taking into account law characteristic. the features that were selected (or eliminated) so far. The behavior of the algorithm through the process of fea- CSA, similarly to wrapper algorithms, is restricted in the ture selection/elimination is displayed in Figure 3; after the selection of the induction algorithm used for evaluating fea- forward selection algorithm identifies the significant features tures sets, due to time limitations. This problem can be in the first few phases, there is a sharp decrease in the contri- reduced by parallelizing, an advantage not shared by other bution values of the features selected in the following phases, 5 1 features. The CSA was tested on number of datasets, and the re- 0.9 sults show that the algorithm can improve the performance 0.8 of the classifier, and successfully compete with an existing 0.7 array of feature selection methods, especially in cases where Accuracy / CVs the features interact with each other; in such cases perform- 0.6 Accuracy on Validation Accuracy on Test ing feature selection with a permutation size higher than one, 0.5 CV x 10 namely not using the common greedy wrapper approach, can 0.4 enhance the classifier's performance significantly. 0.3 The results successfully demonstrate the value of applying game theory concepts to feature selection. While the forward 0.2 selection version of the algorithm is competitive with other 0.1 feature selection methods, our experiments show that over- 0 all, the backward elimination version is superior to them, and 0 5 10 15 20 25 30 (A) Number of Selected Features produces features sets which can be used to generate a highly 1 performing classifier. 0.8 References [Alon et al., 1999] U. Alon, N. Barkai, D. A. Notterman, 0.6 Accuracy / CVs K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad 0.4 patterns of gene expression revealed by clustering of tu- mor and normal colon tissues probed by oligonucleotide 0.2 arrays. Proc. Natl. Acad. Sci., 96:6745­6750, 1999. [Blake and Merz, 1998] C.L. Blake and C.J. Merz. 0 Accuracy on Validation UCI repository of machine learning databases. Accuracy on Test -0.2 http://www.ics.uci.edu/ mlearn/MLRepository.html, Average CV x 100 1998. University of California, Irvine, Dept. of Informa- -0.4 0 50 100 150 200 250 300 tion and Computer Sciences. (B) Number of Eliminated Features [Breiman, 2001] L. Breiman. Random forests. Machine Figure 3: Prediction accuracy and feature contribution during for- Learning, 45(1):5­32, 2001. ward selection (A) and backward elimination (B) for the Arrhythmia [Guyon, 2003] I. Guyon. Design of experiments for the dataset. Both figures show how the performance of the C4.5 classi- NIPS 2003 variable selection benchmark. NIPS 2003 fier improves on the validation set as the algorithm selects (elim- workshop on feature extraction and feature selection, inates) new features, while the contribution values of the selected 2003. features decrease (increase). The backward elimination generalizes better on the test set through the algorithm's progress. The behavior [Joachims, 1999] T. Joachims. Making large-scale SVM for the other datasets is similar. learning practical. Advances in Kernel Methods - Support Vector Learning, B. Schlkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. wrapper algorithms which use search methods such as hill climbing; At each phase the permutations can be computed in [Keinan et al., 2004] A. Keinan, B. Sandbank, C. Hilgetag, parallel and upon completion combined to obtain an estimate I. Meilijson, and E. Ruppin. Fair attribution of functional of contribution values. Furthermore, as the algorithm pro- contribution in artificial and biological networks. Neural gresses, the number of candidate features for either selection Computation, 16(9), 2004. (forward selection) or elimination (backward elimination) de- [Koller and Sahami, 1996] D. Koller and M. Sahami. To- creases. Consequently, the number of permutations sampled ward optimal feature selection. Proceedings of the 13th In- may be reduced, speeding up the algorithm significantly. The ternational Conference on Machine Learning (ML), pages restriction in selecting the learning algorithm for CSA does 284­292, 1996. not apply to the prediction once the features are selected. Af- [Perkins et al., 2003] S. Perkins, K. Lacker, and J. Theiler. ter a set of features is found by the CSA, it may be used by Grafting: Fast, incremental feature selection by gradient any induction algorithm as demonstrated in section 3.2 with descent in function space. Journal of Machine Learning the Dexter and Arcene datasets. Research, 3:1333­1356, 2003. We verified that the feature sets selected by CSA are sig- nificantly different than those selected by filter methods, and [Shapley, 1953] L. S. Shapley. A value for n-person games. Random Forests. It turns out that the first "strong" features In H. W. Kuhn and A. W. Tucker, editors, Contributions to are selected by most methods. But within few iterations, fil- the Theory of Games, volume II of Annals of Mathemat- ters and CSA select entirely different features due to the fact ics Studies 28, pages 307­317. Princeton University Press, that the contribution values of the candidate features are mod- Princeton, 1953. ified, sometimes drastically, according to the already selected 6