A flexible and robust similarity measure based on contextual probability Hui Wang Werner Dubitzky School of Computing and Mathematics School of Biomedical Science University of Ulster at Jordanstown University of Ulster at Coleraine Northern Ireland Northern Ireland h.wang@ulster.ac.uk w.dubitzky@ulster.ac.uk Abstract and similarity. This is evidenced by the large volume of AI research in the areas of analogical reasoning and case-based Arguably, analogy is one of the most important reasoning [Vosniadou and Ortony, 1989; Kolodner, 1993]. In aspects of intelligent reasoning. It has been hy- the context of available background knowledge, analogy can pothesized that, given suitable background knowl- be viewed as a logical inference process [Davies and Russell, edge, analogy can be viewed as a logical infer- 1987]. Another way of looking at analogy holds that similar- ence process. This study follows another school ity can provide a probabilistic basis for inference, and that a of thought that argues that similarity can provide a quantitative framework can be developed for the probability probabilistic basis for inference and analogy. Most that an analogy is correct as a function of the degree of sim- similarity measures (which are frequently viewed ilarity measured or observed [Russell, 1986]. This study is as being conceptually equivalent to distance mea- concerned with the latter view of modeling analogy. sures) are restricted to either nominal or ordinal Distance measures (or, equivalently, similarity measures) attributes, and some are confined to classification are central to many areas related to AI, including reasoning tasks. This paper proposes a flexible similarity under uncertainty, knowledge-based systems, machine learn- measure that is task-independent and applies to ing, pattern recognition, data mining, analogical, case-based, both nominal and ordinal data in a conceptually instance-based reasoning, and to other fields such as statis- uniform way. The proposed similarity measure tics, operations research and decision theory. A large number is derived from a probability function and corre- of distance metrics and measures are available and a particu- sponds to the intuition that if we consider all neigh- lar choice may have considerable implications for the success borhoods around a data point, the data points closer of the problem that is to be addressed. to this point should be included in more of these Distance functions are broadly categorized into those that neighborhoods than more distant points. Experi- can handle ordinal input data, nominal input data, and het- ments we have conducted to demonstrate the use- erogeneous input data, consisting of both ordinal and nominal fulness of this measure indicate that it fares very data 2 . Ordinal distance functions make use of the intrinsic to- competitively with commonly used similarity mea- tal ordering relation in the underlying attribute values, which sures. are either continuous (e.g. weight of an object) or discrete (e.g. number of obstacles). A nominal or symbolic attribute 1 Introduction is a discrete attribute whose values do not necessarily exhibit Natural and artificial intelligent agents rely on different in- a total order relation. For example, an attribute representing ference and reasoning strategies to achieve their goals and the role of crew member may have the values scientist, mis- maintain their intended behavior. Arguably, two of the more sion specialist and commander. Using an ordinal distance successful strategies are based on the notions of analogy 1 measurement on such values is meaningless. Because the notion of `distance' is intrinsically numeri- Also LITA, Université de Metz, Ile du Saulcy, 57045 Metz cal, most available distance measures are defined for data Cedex, France. 1 Here we consider the subject of analogy in the following, rel- with ordinal attributes. However, distance measures that can atively narrow, sense: objects are represented by feature vectors 2 Here we consider two scales of measurement: ordinal and nom- rather than the typical graphs; we do not consider transfer of knowl- inal. In an ordinal scale, values (discrete or continuous) are ordered edge from one object to another; we do not consider cross-domain whereas in a nominal scale, (discrete) values are unordered. knowledge but assume all objects share common set of attributes. 10 process nominal attributes are required by many modern AI algorithms. Some measures that handle nominal attributes do 9 exist. The most well known measure of this kind is the Value 8 b Difference Metric (VDM) [Stanfill and Waltz, 1986]. It is de- 7 fined in terms of attribute values that are conditioned by pos- a 6 terior probabilities of a class. Hence, like some other distance t measures, the VDM is restricted to classification tasks. 5 To cope with heterogeneous data, containing both nominal 4 and ordinal attributes, two approaches are commonly taken: 3 (1) The data is transformed into one of the two data types 2 that complies with the distance measure used, and (2) two types of distance measures are combined and handle the data 1 separately in accordance with the data type (see, for example, [Wilson and Martinez, 1997]). However, both approaches are 1 2 3 4 5 6 7 8 9 10 problematic when it comes to the interpretation of the results. Figure 1: An illustrative example. Each of the 7 concentric The novel distance measure proposed in this study is called circles represents a neighborhood of data point t defined by Neighborhood Counting Metric (NCM). It is derived from a some distance measure. Data point a is covered by 5 neigh- probability function and it can handle both nominal and or- borhoods whereas b by only 2. Geometrically, a is clearly dinal attributes in a conceptually uniform way. Intuitively, it closer (more similar) to t than b. can be understood or interpreted as follows (see Figure 1): If we consider all neighborhoods, N , around a data point, t , then those data points closer to t should be included in more of these neighborhoods, N , than points that are not ming and the normalized Euclidean distance functions and close to t . Usually neighborhoods are interpreted in terms can therefore be used on heterogeneous data. of distance. However, if we adopted this interpretation, we The Value Difference Metric (VDM) [Stanfill and Waltz, would define one distance function in terms of another. To 1986] is designed to handle nominal attributes in classifica- avoid this dilemma, we interpret neighborhood without dis- tion tasks only. It uses pre-computed statistical properties tance through the concept of hypertuples [Wang et al., 1999] from available training data and can be interpreted in terms for both nominal and ordinal attributes. As a result, our dis- of probabilities. The Heterogeneous Value Difference Metric tance measure applies to both ordinal and nominal attributes (HVDM) [Wilson and Martinez, 1997] combines the Euclid- in a uniform way. ean distance and the VDM. It is therefore able to handle het- The new NCM is conceptually simple, it is straightforward erogeneous data, but because of its VDM heritage it is con- to implement, and it has the added property that it is inde- fined to classification tasks. pendent of the underlying analytical or reasoning task (e.g. The Interpolated Value Difference Metric (IVDM) [Wilson classification). The measure's clear and unambiguous mean- and Martinez, 1997] extends the VDM to scenarios involving ing is defined by the number of neighborhoods of a query that ordinal attributes. In the learning phase, it employs a dis- include or cover a given data point. cretization step to collect statistics and determine the proba- Paper outline: Section 2 presents a short review of distance bility for the discretized values (Pa,x,c in the VDM formula), measures relevant to this study; it is followed by Sections 3 but then retains the continuous values in the training data for and 4, which present mathematical details of the new simi- later use in the application or testing phase. The IVDM re- larity measure. In Section 5 the empirical evaluation of the quires a non-parametric probability density estimation to de- method is presented and its results discussed. The paper con- termine the probability values for each class. The Discretized cludes with a summary and a brief discussion of future work. Value Difference Metric (DVDM) is the same as the IVDM except that it avoids the retention of the original continuous 2 A brief review of important distance values but uses only the discretized values. functions [Blanzieri and Ricci, 1999] introduced the Minimal Risk Two of the most common distance functions are Euclid- Metric (MRM), a probability-based distance measure for ean distance and the Hamming distance. The former is re- classification and case-based reasoning. It minimizes the risk stricted to ordinal and the latter is usually used for nomi- of misclassification and depends on probability estimation nal attributes. The Heterogeneous Euclidean-Overlap Metric techniques. As a result, the MRM exhibits a high compu- (HEOM) [Wilson and Martinez, 1997] combines the Ham- tational complexity and is task-dependent. 3 A probability function fashion. While this may appear to be computationally expen- sive, it is often possible to derive a simple formula for G. Let V be a (non-empty) universe of discourse. A -field F Depending on what F is and how neighborhoods are de- on V is a collection of subsets of V , such that: fined, we can use the G probability for different tasks. In 1. V F ; Section 4 we take F to be the set of all regions in a multi- F , where A is the complement of A; 2. If A F then A dimensional space definable as hypertuples and our similarity measure is derived from this interpretation. 3. If A, B, ˇ ˇ ˇ F , then A B ˇ ˇ ˇ F . A probability function P over V is a mapping from F to 3.1 Estimation of G [0, 1] satisfying the three axioms of probability [Ash, 1972]. This section describes how G is estimated for a data point If P is restricted to V , it is called a probability mass function from data samples. Let D be a sample of data drawn from V (discrete) or probability density function (continuous). according to an unknown probability distribution P. Our aim Suppose we have a probability function P as defined above. ^ is to calculate G(t |D) for any t V . For X F let f (X ) be a non-negative measure of X satisfying To calculate G we need P, which can be estimated from / f (X1 X2 ) = f (X1 ) + f (X2 ) if X1 X2 = 0. As an example, data assuming the principle of indifference as follows: For we can take f (X ) for the cardinality of X . any E F , Consider a function G : F [0, 1] such that, for X F , ^ P(E |D) = |E |/n P(E ) f (X E )/K G(X ) = where |E | is the number of elements in E and n is the number E F of elements in D. Additionally we assume that f (X ) = |X | where K = E F P(E ) f (E ). It can be easily shown that G is for X F . a probability function. We then have, G(X ) is calculated from all those E F that overlap with ^ P(E |D) × f (E t ) ^ X (i.e., f (X E ) = 0). These E 's are relevant to X and serve definition G(t |D) = K E F as the contexts in which G(X ) is induced. Each such E is ^ P(E |D) × |E t | = specification of f called a neighborhood of X . In other words a neighborhood K E F of X is an element E of F , i.e., a subset of V , such that E ^ P(E |D) = assumption that t is in E so E t = t overlaps X . K E F ,t E In practice P is usually not known, but a sample of data 1 |E | estimation of P by principle of indifference = nK drawn from the data space according to P is commonly avail- E F ,t E 1 able. P can be estimated from the sample by either paramet- cov(t , x) = expansion and then re-organisation nK ric or non-parametric techniques. In some tasks (e.g., clas- xD where K is a normalisation factor, and cov(t , x) is the number sification) the point-wise probability is needed. When there of such E F that covers both t and x. We call this number is insufficient data, which is not uncommon, the point-wise the cover of x with respect to t . probability may be difficult to estimate using non-parametric To obtain cov(t , x), a straightforward approach would be methods. However, it is likely that probability can be esti- to iterate over all E F and check if E covers both t and x. mated for some regions (contexts) in the data space based on Clearly, such a process is undesirable because of its exponen- the sample. Using the G probability function, we can estimate tial complexity. In Section 4 we will present an efficient way the G probability for any single data point from knowledge of calculating cov(t , x). of the P probabilities of various regions or contexts. This provides us with a formal way of inference about probability 4 Measuring distance through counting under incomplete (hence uncertain) situations. To calculate the G probability for X F in practice, we This section considers an interpretation of neighborhood and need to derives a formula for calculating cov(t , x), which is then taken 1. Find the set of all neighborhoods appropriate to the prob- as a measure of similarity or distance. lem at hand, 4.1 Neighborhood 2. Estimate the P probability for every neighborhood, Pursuing a non-distance-based conceptualization of neigh- 3. Finally, estimate (calculate) the G probability for X ac- borhood, we interpret a neighborhood as a hypertuple [Wang cording to its definition. et al., 1999], and a neighbor as a data point (or tuple) cov- ered by some neighborhood. So a neighborhood of a tuple is In classification tasks, we can further calculate the conditional a hypertuple that covers the tuple. G probability of a class given a single data point in a similar to h is x V such that x h. By this definition any simple Hypertuples tuple has a neighborhood. At least the maximal hypertuple Let R = {a1 , a2 , ˇ ˇ ˇ , an } be a set of attributes, and d om(a) def in the extended data space is a neighborhood of any simple be the domain of attribute a R. Furthermore let V = def tuple since the maximal hypertuple covers all simple tuples. n=1 d om(ai ) and L = n=1 2d om(ai ) . V is the data space de- i i For a query t V , not all hypertuples in i Si are neigh- fined by R, and L an extended data space. A (given) data set borhoods of t . For a hypertuple h to be a neighborhood of is D V ­ a sample of V . t we must have t (ai ) h(ai ) for all i. Therefore, to gen- The attributes can be either ordinal or nominal. For sim- erate a neighborhood of t , we can take an si Si such that plicity, we assume the domain of any attribute is finite, but t (ai ) si for all i, resulting in a hypertuple s1 , s2 , ˇ ˇ ˇ , sn . the results are not limited to finite domains. If ai is nominm l, the number of si Si such that t (ai ) si is a If we write an element t V by v1 , v2 , ˇ ˇ ˇ , vn then vi = i = mi -1 -1 2mi -1 since si is any subset of d om(ai ) d om(ai ). If we write h L by s1 , s2 , ˇ ˇ ˇ , sn then si 2d om(ai ) i=0 i i N or si d om(ai ). that is the super set of t (ai ). If ai is ordinal, this number An element of L is called a hypertuple, and an element of is Ni = (max(ai ) - t (ai ) + 1) × (t (ai ) - min(ai ) + 1) since V a simple tuple. The difference between the two is that a (max(ai ) - t (ai ) + 1) is the number of ordinal values above field within a simple tuple is a value while a field within a t (ai ), and (t (ai ) - min(ai ) + 1) is such a number below t (ai ). hypertuple is a set. If we interpret vi d om(ai ) as a singleton Any pair of values from the two parts respectively forms an set {vi }, then a simple tuple is a special hypertuple. Thus we interval. can embed V into L, so V L. To summarize, the number of neighborhoods of t is i Ni , Consider two hypertuples h1 and h2 , where h1 = where s11 , s12 , ˇ ˇ ˇ , s1n and h2 = s21 , s22 , ˇ ˇ ˇ , s2n . We say h1 is (1) covered by h2 (or h2 covers h1 ), written h1 h2 , if for 2mi -1 , if ai is nominal i {1, 2, ˇ ˇ ˇ , n}, i N = (max(ai ) - t (ai ) + 1) × (t (ai ) - min(ai ) + 1), x s1i , min(s2i ) x max(s2i ) if ai is ordinal if ai is ordinal. s1i s2i if ai is nominal Cover of simple tuples def Furthermore the sum of h1 and h2 , written by h1 + h2 = Under the above interpretation of neighborhood, we know ex- s1 , s2 , ˇ ˇ ˇ , sn , is: for each i {1, 2, ˇ ˇ ˇ , n}, actly the number of all neighborhoods of a given simple tuple {x d om(ai ) : min(s1i s2i ) x max(s1i s2i )}, t . Here we present an efficient way of calculating cov(t , x), the number of neighborhoods of t that cover x, which is si = if ai is ordinal needed in Section 3.1. s1i s2i , if ai is nominal Consider two simple tuples t = t1 , t2 , ˇ ˇ ˇ , tn and x = The product operation can be similarly defined. It turns out x1 , x2 , ˇ ˇ ˇ , xn . A neighborhood h of t covers t by defini- that L, , +, is a lattice. tion, i.e., t h. What we need to do is to check if h covers x For a simple tuple t , t (ai ) represents the projection of t onto as well. In other words, we want to find all hypertuples that attribute ai . For a hypertuple h, h(ai ) is similarly defined. cover both t and x. A hypertuple can be generated by taking one subset from Eq.1 specifies the number of all simple tuples that cover each attribute. Let ai be an attribute, i = 1, 2, ˇ ˇ ˇ , n; Si be the t only. We take a similar approach here by looking at each def set of all subsets of the domain of ai ; and Ni = |Si |. Then a attribute and explore the number of subsets that can be used hypertuple h is an element of i Si . Therefore the number of to generate a hypertuple covering both t and x. Multiplying all hypertuples is i Ni . these numbers across all attributes gives rise to the number If ai is nominal, then Si = 2d om(ai ) and so Ni = 2mi , where we require. mi = |d om(ai )|. If ai is ordinal, then an element s Si cor- Consider attribute ai . If ai is ordinal, then the number of responds to an interval [min(s), max(s)]. As a result, some intervals that can be used to generate a hypertuple covering elements of Si may correspond to the same interval, and both xi and ti is as follows: hence become equivalent. In general we have the follow- Ni = (max(ai ) - max({xi , ti }) + 1) × (min({xi , ti }) - min(ai ) + 1). ing number of distinctive intervals for an ordinal attribute: Ni = m=-1 (mi - j) = m=1 j = mi (mi +1) . i i If ai is nominal, the number of subsets for the same purpose j0 j 2 is: Neighborhoods as hypertuples 2 mi -1 , if x = t For any simple tuple t , a neighborhood of t is taken to be a i i Ni = 2mi -2 , otherwise hypertuple h such that t h; and a neighbor of t with respect weighting [Baily and Jain, 1978]. The purpose of the eval- Recall that mi = |d om(ai )|. To summarize, the number of neighborhoods of t covering uation is to compare the new measure with some of the com- x is cov(t , x) = i Ni , where monly used distance measures in a setup involving heteroge- neous data. The evaluation uses public benchmark data sets (2) from UC Irvine Machine Learning Repository, which were (max(ai ) - max({xi , ti }) + 1) × (min({xi , ti }) - min(ai ) + 1), selected with respect to their balance of ordinal and nominal if ai is ordinal attributes. Ni = 2mi -1 , if ai is nominal and xi = ti We implemented a k-NN algorithm with the novel NCM as m -2 i 2 , if ai is nominal and xi = ti well as the measures HEOM, HVDM, IVDM, and DVDM. The computational runtimes of the MRM turned out too ex- cessive, so it was excluded from the study. In the experiments 4.2 Use of cover as similarity measure k was set to 1, 6, 11, 16, 21, 16, 31 and to `MaxK' (i.e., k = the number of data tuples in the training data). We adopted We use cov(t , x) as a measure of similarity between t and x, a 10-fold cross-validation procedure, and for each measure which we call the Neighborhood Counting Metric or simply and each k value we ran the cross-validation 10 times and NCM. That is, for any two tuples x and y, the NCM between recorded the results for subsequent analysis. them is Due to space limitations, we report only some of the results n NCM (x, y) = cov(x, y) = Ni in detail. Table 1 shows the results when weighting was used (3) and k = M axK . i=1 A statistical significance analysis was carried out using where n is the number of attributes and Ni is given by Eq.2. the Student t -test (two samples, assuming unequal variances) It is clear that NCM (x, y) 0, NCM (x, x) NCM (x, y) and with = 0.05 (at a 95% confidence level). For each data NCM (x, y) = NCM (y, x). Therefore the NCM is reflexive and set, each measure and each k value, we ran a 10-fold cross- symmetric, the properties generally required for a similarity validation using k-NN 10 times with random partitioning of measure [Osborne and Bridge, 1997]. data, resulting in a sample of 10 values. For a pair of samples, This measure can be interpreted intuitively as follows. If by two different measures, we have a total of 20 values. So we consider all neighborhoods around a data tuple, those tu- the critical value is 2.1. We then calculate the `t' values. If ples closer to this tuple should be included in more neigh- t 2.1 or t -2.1 the two samples are significantly different borhoods and those farther away should be included in fewer (the classification rate of one sample is significantly higher or neighborhoods (see Figure 1). As a result, closer tuples lower than that of the other). Notice that every value in Table should be assigned higher cover values. 1 is the average of a sample of 10 values. In contrast to its usual interpretation in terms of distance, From Table 1 we can see that based on our experimental we interpret the notion of a neighborhood without distance design the NCM achieved 17 out of 20 `significantly' higher through the concept of hypertuples for both nominal and ordi- classification success rates compared to the other methods nal attributes. As a consequence, our novel distance measure (last row in table). Looking at the subtotals for `nominal', applies to both ordinal and nominal attributes in a conceptu- `ordinal' and `mixture' categories (reflected by subtotals from ally uniform way. top to bottom), the NCM achieves the largest margin over its Incidentally, the NCM intrinsically handles missing values competitors for the `nominal' data sets, followed by `mixture' in a fashion consistent with other measures. Recall, that the and then `ordinal'. This suggests that, when all data tuples are NCM is a product of all Ni 's, where i is attribute index. For taken into consideration, the NCM is clearly superior under two data tuples, t and x, if there is a missing value in t or x for all circumstances. attribute i, then Ni is set to 1. As a result, this attribute does The details of the results involving other values for k are not contribute towards the NCM value. not shown. In terms of k-changes, we observe that with- out weighting there was no significant difference between the 5 Evaluation used measures. When weighting was used, there was still lit- The new Neighborhood Counting Metric is task-independent tle difference for small k values. The general trend for all and can therefore be used for classification, clustering and measures was: as k got larger the classification success rate other analytical tasks involving distances or similarities. We increased slightly but soon started to decline. The difference empirically evaluated the NCM in the context of a classifi- then showed up as the NCM displayed a much slower rate cation task, using the k-nearest neighbor (k-NN) classifica- of decline and, after k > 11, the NCM consistently outper- tion algorithm with and without distance-based (neighbor) formed the other four measures (see k = M axK as discussed Data DVDM NCM HEOM IVDM HVDM above). We can conclude that the NCM produces less vari- Audiology (Standardized) 22.9 66.3 44.2 23.0 22.9 able or more robust results than the other four measures. Bridges2 (Standardized) 44.3 60.8 52.9 44.3 44.3 Primary Tumor 25.6 44.9 26.0 25.6 25.6 Soybean (Large) 84.6 86.7 84.6 84.6 84.6 6 Conclusion TTT 65.8 68.0 65.3 65.8 65.8 Vote 90.9 91.0 88.8 90.9 90.9 Yeast 32.3 33.5 32.0 33.0 32.0 Starting from a probability function, we have developed Zoo 78.8 95.0 69.5 80.0 78.8 a novel similarity or distance measure, the Neighborhood avg (nominal) 59.3 70.5 61.3 59.5 59.3 Diabetes 63.5 63.6 63.5 63.5 63.5 Counting Metric, which can be used with ordinal, nominal Ecoli 61.1 50.8 43.1 50.1 43.1 and heterogeneous attributes in a conceptually uniform way. Glass 47.9 61.1 47.2 55.6 47.2 Iono 68.1 80.5 63.2 64.3 63.2 This measure is defined without reference to any particular Iris 93.2 91.3 90.7 94.3 90.7 Pima 63.0 63.0 63.0 63.1 63.0 analytical task (e.g., classification). This means that the mea- Sonar 74.3 85.7 73.1 71.6 73.1 sure is potentially useful for many reasoning, inferential and Vehicle 46.3 60.1 42.3 48.2 42.3 Wine 94.7 96.7 94.6 98.1 94.6 reasoning tasks and systems modeling analogy or similarity. avg (ordinal) 73.9 77.3 70.1 74.0 70.1 Because the NCM is based on a simple, easy-to-implement Anneal 75.3 75.3 75.3 75.3 75.3 Australian 85.5 87.5 82.8 85.3 82.8 mathematical formulation and has a computational complex- Auto 49.1 67.9 46.4 48.2 46.4 ity in the same order as the Euclidean distance measure, it is Breast-Cancer 75.7 76.0 75.7 75.7 75.7 Bridges1 44.3 62.2 49.1 44.3 44.3 a prime candidate for practical AI methods and tools. Credit Screening 60.8 88.3 82.0 64.7 59.5 German 68.7 69.0 68.7 68.7 68.7 Our empirical evaluation demonstrates that in a k-NN- Heart 79.2 82.3 80.6 81.7 72.0 based classification task, the measure significantly outper- Hepatitis 76.6 81.7 76.6 76.6 76.6 Hors-Colic 77.9 84.7 64.4 79.4 64.4 forms its competitors when distance-based neighbor weight- avg (heterogeneous) 69.3 77.5 70.2 70.0 66.6 ing is used and when k is not too small, in particular when all Sig. count 1 17 0 2 0 data tuples are taken into consideration. As k gets larger, the Table 1: Average results (classification success rates) over 10 NCM displayed a consistently superior performance over the runs of the algorithms, where neighbor weighting was used reference methods. The difference is most significant when and k=`MaxK'. Best values are shown in bold, and statisti- all data tuples in a training data set are considered. This im- cally significant values in bold italic. plies that the NCM is less sensitive to k than the other mea- sures considered in this study. Given an application based on the k-NN algorithm, if the optimal value for k is not [Kolodner, 1993] Janet Kolodner. Case-Based Reasoning. known, we can simply consider all or a relatively large set Morgan Kaufmann Publishers, San Mateo, CA, 1993. of data tuples without a significantly compromising perfor- [Osborne and Bridge, 1997] Hugh Osborne and Derek mance. Therefore, the performance of this measure is more Bridge. Models of similarity for case-based reasoning. predictable than the other methods investigated in this study. In Procs. of the Interdisciplinary Workshop on Similarity In our experiments, all attributes were assumed to have and Categorisation, pages 173­179, 1997. equal weights. Future work will investigate how to determine [Russell, 1986] Stuart J. Russell. Quantitative analysis of the best attribute weights to achieve improved performance, analogy. In Proc of AAAI86, pages 284­288, Philadelphia, as well as an application of the NCM to clustering tasks. PA, 1986. Morgan Kaufmann. [Stanfill and Waltz, 1986] C. Stanfill and D. Waltz. To- References ward memory-based reasoning. Communication of ACM, [Ash, 1972] R.B. Ash. Real Analysis and Probability. Aca- 29:1213­1229, 1986. demic Press, New York, 1972. [Vosniadou and Ortony, 1989] Stella Vosniadou and Andrew [Baily and Jain, 1978] T. Baily and A. K. Jain. A note on Ortony, editors. Similarity and Analogical Reasoning. distance-weighted k-nearest neighbor rules. IEEE Trans. Cambridge University Press, New York, USA, 1989. Syst. Man Cyber., 8(4):311­313, 1978. [Wang et al., 1999] H. Wang, W. Dubitzky, I. Düntsch, and [Blanzieri and Ricci, 1999] Enrico Blanzieri and Francesco D. Bell. A lattice machine approach to automated case- Ricci. Probability based metrics for nearest neighbor clas- base design: Marrying lazy and eager learning. In Proc. sification and case-based reasoning. Lecture Notes in IJCAI99, pages 254­259, Stockholm, Sweden, 1999. Computer Science, 1650:14­29, 1999. [Wilson and Martinez, 1997] D. Randal Wilson and Tony R. [Davies and Russell, 1987] Todd R. Davies and Stuart J. Martinez. Improved heterogeneous distance functions. Russell. A logical approach to reasoning by analogy. In Journal of Artificial Intelligence Research, 6:1­34, 1997. Proc. of IJCAI87, pages 264­270, Milan, Italy, 1987.