Beyond TFIDF Weighting for Text Categorization in the Vector Space Model Pascal Soucy Guy W. Mineau Coveo Université Laval Quebec, Canada Québec, Canada psoucy@coveo.com guy.mineau@ift.ulaval.ca and Liu, 1999; Brank et al., 2002; Dumais et al., 1998]. Abstract While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. KNN and SVM are two machine learning Actually, this weighting method does not leverage the approaches to Text Categorization (TC) based on information implicitly contained in the categorization task to the Vector Space Model. In this model, borrowed represent documents. from Information Retrieval, documents are To illustrate this, let us suppose a text collection X and two represented as a vector where each component is categorization tasks A and B. Under the TFIDF weighting associated with a particular word from the representation, each document in X is represented by the vocabulary. Traditionally, each component value is same vector for both A and B. Thus, the importance of a assigned using the information retrieval TFIDF word in a document is seen as independent from the measure. While this weighting method seems very categorization task. However, we believe that this should appropriate for IR, it is not clear that it is the best not be the case in many situations. Suppose that A is the task choice for TC problems. Actually, this weighting that consists of classifying X in two categories: documents method does not leverage the information that pertain to Computers and documents that don't. implicitly contained in the categorization task to Intuitively, words such as computer, intel and keyboard represent documents. In this paper, we introduce a would be very relevant to this task, but not words such as new weighting method based on statistical the and of; for this reason, the former words should have a estimation of the importance of a word for a higher weight than the latter ones. Suppose, now that B specific categorization problem. This method also consists of classifying X in two very different categories: has the benefit to make feature selection implicit, documents written in English and documents written in since useless features for the categorization other languages. It is arguable that in this particular task, problem considered get a very small weight. words such has the (English stop word) and les (French stop Extensive experiments reported in the paper shows word), are very relevant. However, under TFIDF, the would that this new weighting method improves get a very small weight since its IDF (Inverse Document significantly the classification accuracy as Frequency) would be low. In fact, it would get the same measured on many categorization tasks. weight that was assigned for task A. While this example is somewhat an extreme case, we believe that a weighting approach could benefit from the knowledge about the 1 Introduction categorization task at hand. KNN and SVM are two machine learning approaches to In this paper, we introduce a new weighting method based Text Categorization (TC) based on the vector space model on statistical estimation of a word importance for a [Salton et al., 1975], a model borrowed from Information particular categorization problem. This weighting also has Retrieval (IR). Both approaches are known to be among the the benefit that it makes feature selection implicit since most accurate text categorizers [Joachims, 1998a; Yang and useless features for the categorization problem considered Liu, 1999]. In the vector space model, documents are get a very small weight. represented as a vector where each component is associated Section 2 presents both the TFIDF weighting function and with a particular word in the text collection vocabulary. the new weighting method introduced in this paper. Section Generally, each vector component is assigned a value 3 describes our evaluation test bed. In section 4, we report related to the estimated importance (some weight) of the results that show significant improvements in terms of word in the document. Traditionally, this weight was classification accuracy. assigned using the TFIDF measure [Joachims, 1998a; Yang Another approach is presented in [Han 1999]. In this study, 2 Weighting approaches in text categorization vector components are weighted using an iterative approach 2.1 TFIDF weighting involving the classifier at each step. For each iteration, the weights are slightly modified and the categorization is the most common weighting method used to TFIDF accuracy is measured using an evaluation set (a split from describe documents in the Vector Space Model, particularly the training set). Convergence of weights should provide an in IR problems. Regarding text categorization, this optimal set of weights. While appealing (and probably a weighting function has been particularly related to two near optimal solution if the training data is the only important machine learning methods: KNN and SVM. The information available to the classifier), this method is TFIDF function weights each vector component (each of generally much too slow to be used, particularly for broad them relating to a word of the vocabulary) of each document problems (involving a large vocabulary). on the following basis. First, it incorporates the word frequency in the document. Thus, the more a word appears 2.3 A Weighting Methods based on Confidence in a document (e.g., its TF, term frequency is high) the more The weighting method (named ConfWeight in the rest of the it is estimated to be significant in this document. In addition, text) introduced in this paper is based on statistical IDF measures how infrequent a word is in the collection. This value is estimated using the whole training text confidence intervals. Let xt be the number of documents collection at hand. Accordingly, if a word is very frequent in containing the word t in a text collection and n, the size of the text collection, it is not considered to be particularly this text collection. We estimate the proportion of representative of this document (since it occurs in most documents containing this term to be: documents; for instance, stop words). In contrast, if the word is infrequent in the text collection, it is believed to be xt + 0.5 z 2 2 (3) p= % very relevant for the document. TFIDF is commonly used in n + z 2 2 IR to compare a query vector with a document vector using Where ~ is the Wilson proportion estimate [Wilson, 1927] a similarity or distance function such as the cosine similarity p function. There are many variants of TFIDF. The following and z2/2 is a value such that (z/2) = /2, is the common variant was used in our experiments, as found in t-distribution (Student's law) function when n < 30 and the [Yang and Liu, 1999].1 normal distribution one when n 30. So when n 302, ~ is: p n xt + 1.96 lo g ( t f t , d + 1) lo g if tf t ,d 1 (4) p= % w eig h tt ,d = xt (1) n + 3 .8 4 0 o th erw ise Thus, its confidence interval at 95% is: where tft,d is the frequency of word t in document d, n is the p (1 - p ) % % (5) p ± 1 .9 6 % number of documents in the text collection and xt is the n + 3 .8 4 number of documents where word t occurs. Normalization to unit length is generally applied to the resulting vectors Most categorization tasks can be formulated in a way to use (unnecessary with KNN and the cosine similarity function). only binary classifiers (e.g. a classifier that decides whether a document belongs to a specific category or not). Thus, for 2.2 Supervised weighting a task with n categories, there will be n binary classifiers. For a given category, let us name ~ + the equation (4) [Debole and Sebastiani, 2003] have tested and compared p some supervised weighting approaches that leverages on the applied to the positive documents (those who are labeled as training data. These approaches are variants of TFIDF being related to the category) in the training set and ~ - to p weighting where the idf part is modified using common those in the negative class. Now, we use the label MinPos functions used to conduct feature selection. In this paper, for the lower range of the confidence interval of ~ + , and the p their best finding is a variant of the Information Gain, the label MaxNeg for the higher range of that of ~ - according to Gain Ratio. Respective to a category ci, the Gain Ratio of p (5) measured on their respective training set. Let now the term tk is: MinPosRelFreq be: IG( t k , c i ) GR ( t k , c i ) = - P ( c ) lo g 2 P ( c ) (2) c {c i , ci } MinPos (6) MinPosRelFreq = MinPos + MaxNeg 1 In [Joachims, 1998a], a slight variant is used where the tf is used 2 When n < 30 (which occurs for categories with few positive without the logarithm function, but [Yang and Liu, 1999] reports instances), the t-distribution was used instead of the normal law; no significant difference in classification accuracy whether the log thus, equations should be modified accordingly. is applied or not). more frequently (relative to the number of documents) in the We now define the strength of term t for category +: positive category than in the negative one. Therefore, this log ( 2 MinPosRelFreq) if MinPos > MaxNeg weighting method favors features that are proportionally strt ,+ = 2 (7) more frequent in the positive class. This weight decreases as 0 otherwise MaxNeg increases. Eq. (7) scales the weight values linearly into the [0,1] range, so that the resulting weight is 0 when a Therefore, weight 0 iff the word appears proportionally term occurs at the same relative frequency in both classes or more frequently in the + category than in the ­ category, proportionally more frequently in the negative set. Finally, even in the worst (measured by the confidence interval) Eq. (8) makes the decrease faster, to reflect the rate at which estimation error scenario. There might be many categories features lose their "energy" as they are more evenly where weight 0, since the categorization task is divided in distributed among the positives and the negatives. As a n binary classifiers. We name the maximum strength of t: consequence, very predictive features get a high weight, regardless of their absolute frequency (only proportion ( ) 2 differences matter). maxstr(t ) = (8) max ( strt ,c ) As we are interested in weighting all training and testing cCategories documents components in the vector space model, we must use (8) with individual documents, taking the document Maxstr(t) is a global policy technique [Debole and term frequency into account. We define the ConfWeight of t Sebastiani, 2003], that is, the value is that of the best in document d as: classifier and is thereafter used for all n binary classifiers. Using a global policy allows us to use the same document representation for all n binary classifiers. While local ConfWeightt ,d = log(tf t ,d + 1)maxstr(t ) (9) policies seem intuitively more appealing than global policies when the categorization task is divided in n binary Eq. (9) is quite similar to the TFIDF equation in (1): the first problems, [Debole and Sebastiani, 2003] shown that global part weights the term according to its importance for the policies are at least as good as local policies. Note that a document while the second part weights the term globally. value of 0 for maxstr(t) is akin to a feature selection However, unlike TFIDF, ConfWeight uses the categorization deciding to reject the feature. problem to determine the weight of a particular term. Figure 1 presents an example to highlight the behavior of eq. (6) to (8). In this figure, MinPos is set to 0.5, which means that a hypothetic term occurs at least (recall that this 3 Methodology value is the lower range of its relative document frequency confidence interval) in half the documents from the positive 3.1 Corpora set. Then, the curves (labeled (6), (7) and (8) in the graph) In this paper, three data sets previously studied in the consists of the resulting weights for different values of literature have been selected. These datasets are: Reuters- MaxNeg. Eq. (6) gives more weight to terms that occur 21578, Ohsumed and the new Reuters Corpus Vol 1. Let us briefly describe these datasets. Reuters-21578 [Lewis, 1997] is made of categories related 1 to business news report. It is written using a limited 0.9 vocabulary in a succinct manner. We used the ModApte 0.8 [Lewis, 1997] split. There are 90 categories having at least 0.7 ( 6) one training and one testing document. These categories are We ig h t ( 7) 0.6 highly unbalanced. Each document may be categorized in ( 8) 0.5 more than one category. Ohsumed comes from a very large text collection (the 0.4 MedLine Bibliographical Index) and is rarely used with all 0.3 available categories and documents. We have chosen to split 0.2 this text collection as done in [Lewis et al., 1996]. The 0.1 result is a task comprising 49 closely related categories 0 using a very technical vocabulary. Similarly to Reuters, a 0 0.2 0.4 0.6 0.8 1 document may be classified in one or many categories. M axNe g Finally, Reuters Corpus Vol. 1 (RCV1) [Rose et al., 2001] is a newer text collection released by Reuters Corp. that Figure 1: Weight when varying M axN eg with a consists of one full year of new stories. There are about fixed M inPos = 0.5 850,000 documents. 103 categories have documents assigned to them. This collection is very large, thus making it a very challenging task for learning models such as SVM number of documents that have not been labeled by the and KNN, which have polynomial complexity. Particularly, classifier to the category, but that should have. we were not able to use SVM with a large training set since For any category, the classifier precision is defined as SVM does not scale up very well to large text collections. A/(A+C) and the recall as A/(A+B). To combine these two Using our KNN implementation, we have limited the measures in a single value, the F-measure is often used. The training set to the first 100,000 documents and the testing F-measure reflects the relative importance of recall versus set to the next 100,000 documents3. An average of 3.15 precision. When as much importance is granted to precision categories is assigned to each testing document (over as it is to recall we have the F1-measure: 315,000 total assignments). ( precision + recall) (10) F1 = 3.2 Classifiers, feature selection and settings 2 precision recall The weighting method presented in this paper is intended to weight documents in the Vector Space Model. Thus, it can The F1-measure is an estimation of the breakeven point be used only with classifiers using this model. For this where precision and recall meets if classifier parameters are reason, we have evaluated our method using both KNN and tuned to balance precision and recall. Since the F1 can be SVM and compared the results obtained with TFIDF and evaluated for each category, we get n different F1 values. GainRatio [Debole and Sebastiani, 2003] weighting. To compare two methods, it is needed to combine all the We have used the SVMlight package [Joachims, 1998b] F1 values. In order to do that, two approaches are often and the KNN classifier described in [Yang and Liu, 1999]. used: the macro-F1 average and the micro-F1 average. The In our experiments with SVM, we divided each macro-F1 average is the simple average of all F1 values; categorization task into n binary classification problems, as thus each category gets the same weigh in the average. In usual. In contrast, KNN is able to classify a document counterpart, the micro-F1 average weighs large categories among the n categories using one multi-category classifier. more than smaller ones. The micro-F1 is the F1 in (10) To decide whether a document is classified or not in a where A, B, C and D are global values instead of category- particular category, thresholds were learned for each based ones. For instance, A in the micro-F1 is the total category [Yang and Liu, 1999]. TFIDF experiments were number of classifications made by the n classifiers that were weighted using Eq. (1) and then normalized to unit length. good predictions. Micro-F1 has been widely used in Text GainRatio experiments were weighted as done by [Debole Categorization [Lewis et al., 1996; Yang and Liu, 1999; and Sebastiani, 2003]. Joachims, 1998a]. Table 2 includes micro-F1 results for To reach optimal classification accuracy, feature selection SVM while Table 3 includes those of KNN. For each might be required. Thus, we have included feature selection experiment, the best score (among TFIDF, GainRatio and in our tests. The Information Gain measure has been used to ConfWeight) is bolded. rank the features and many thresholds have been used to These results show that at low Information Gain filter features out; with ConfWeight, in addition to the use of thresholds, ConfWeight clearly outperforms both TFIDF and the Information Gain to select features, when maxstr (see GainRatio. When more drastic term selection is conducted, Eq. 8) was 0, the feature was also rejected. Stop words were overall scores tend to decrease for all three term weighting not removed and words were not stemmed. methods. Is it very interesting to note the very large difference between ConfWeight and TFIDF using KNN. This 4 Results and discussion difference is particularly significant for a collection of the To assess classifier accuracy, a confusion matrix is created size of RCV1. for each category: Figure 2, 3 and 4 show the curves resulting from the use of an increasing number of features (decreasing Information Classifier Classifier Gain thresholds) for each weighting method using KNN. positive label negative label Clearly, ConfWeight is the only weighting that doesn't True positive label A B suffer a decrease in accuracy as low-scored features are True negative label C D added. TFIDF results are less stable than ConfWeight and GainRatio, an observation that leads us to claim that TFIDF Table 1: Confusion matrix used to evaluate classifier accuracy is very sensitive to the choice of feature selection settings. For instance, A (the true positives) is the number of While GainRatio is less sensitive to the presence of all documents labeled by the classifier to the category that are terms (relevant or not) than TFIDF, ConfWeight seems not to correct predictions. Similarly, B (the false negatives) is the need term selection at all, arguably due to its inherent term selection mechanism. We believe that ConfWeight can be used without feature selection and produce very good results. 3 At the time these experiments were conducted, the LYRL2004 split was not yet released IGain Reuters 0.87 Weighting Ohsumed threshold 21578 0.86 TFIDF .848 .65 0.85 0 GainRatio .875 .702 0.84 Micro-F1 ConfWeight .877 .706 0.83 TFIDF .851 .679 0.82 0.001 GainRatio .875 .703 0.81 ConfWeight .877 .707 0.8 TFIDF .875 .695 0.79 0.005 GainRatio .877 .701 Nu mb e r of features ConfWeight .697 .882 TFIDF TFIDF .876 .692 GainR at io 0.01 GainRatio .877 .696 ConfW eight ConfWeight .882 .697 TFIDF .693 .874 0.015 GainRatio .871 .694 Figure 2: KNN Micro-F1s on Reuters-21578 as the number of ConfWeight .874 .696 feature increases TFIDF .842 .658 0.03 GainRatio .844 .658 0.7 ConfWeight .836 .658 0. 68 Table 2: SVM Micro-F1s by text collection and weighting method 0. 66 Micro-F1 0. 64 IGain Reuters 0. 62 Weighting Ohsumed RCV1 threshold 21578 0.6 TFIDF .816 .588 .785 0 GainRatio .834 .659 .792 0. 58 ConfWeight .861 .683 .830 Numb e r of features TFIDF .819 .645 .812 TFIDF .001 GainRatio .833 .66 .812 GainR at io ConfWeight .862 .687 .833 ConfW eight TFIDF .843 .621 .828 .005 GainRatio .840 .661 .817 Figure 3: KNN Micro-F1s on Ohsumed as the number of feature ConfWeight .861 .683 .833 increases TFIDF .856 .640 .823 .01 GainRatio .834 .659 .822 ConfWeight .864 .681 .824 0.84 TFIDF .85 .655 .811 0.82 .015 GainRatio .832 .654 .812 0.8 Mcro-F1 ConfWeight .811 .856 .675 TFIDF .640 .757 .832 i 0.78 GainRatio .799 .639 .765 .03 0.76 ConfWeight .824 .646 .749 0.74 Table 3: KNN Micro-F1s by text collection and weighting method Nu mb e r of features Another interesting remark is that the best overall scores on TFIDF each corpora, both using KNN and SVM, are obtained by GainR at io ConfWeight (Reuters-21578: .882 with SVM and .864 with ConfW eight KNN; Ohsumed: .707 with SVM, .687 with KNN; RCV1 Figure 4: KNN Micro-F1s on RCV1 as the number of feature .833 with KNN). increases References Finally, we believe that ConfWeight is able to leverage the many features that get a low Information Gain score, which [Brank et al., 2002] J. Brank, M. Grobelnik, N. Frayling and is not always the case with TFIDF and GainRatio. Let us D. Mladenic. Interaction of Feature Selection Methods and take as an example the TFIDF behavior with SVM in table 2. Linear Classification Models, In Proc. of 19th Conf. on At .005, there is much less features in the feature space than Machine Learning (ICML-02), Workshop on Text Learning. at .001. Adding features scored between .001 and .005 decreases the Micro-F1 for Reuters-21578 and Ohsumed. [Debole and Sebastiani, 2003] F. Debole and F. Sebastiani. On the other hand, the accuracy with ConfWeight increases Supervised term weighting for automated text on Ohsumed if these same low-score features are added to categorization. In Proc. of SAC-03, 18th ACM Symposium the feature space, while results on Reuters-21578 stay about on Applied Computing, Melbourne, US, 2003, pp. 784-788. the same. Using only TFIDF, we might have concluded that features which have an Information Gain lower than 0.005 [Dumais et al., 1998] S. Dumais, J. Platt, D. Heckerman and are harmful for most categorization tasks. Conversely, M. Sahami. Inductive learning algorithms and results so far using ConfWeight tend to show the relevancy representations for text categorization. In Proc. of the 1998 and usefulness of low-score features in some settings. ACM 7th International Conference on Information and Knowledge Management, 148-155, 1998. 5 Conclusions [Han 1999] E.H. Han. Text Categorization Using Weight In this paper, we have presented a new method Adjusted k-Nearest Neighbor Classification. PhD thesis, (ConfWeight) to weight features in the vector-space model University of Minnesota, Oct.1999. for text categorization by leveraging the categorization task. So far, the most commonly used method is TFIDF, which is [Joachims, 1998a] T. Joachims. Text Categorization with unsupervised. To assess our new method, tests have been Support Vector Machines: Learning with Many Relevant conducted using three well known text collections: Reuters- Features. In Proc. of the European Conference on Machine 21578, Ohsumed and Reuters Corpus Vol. 1. As Learning, Springer, 1998. ConfWeight generally outperformed TFIDF and GainRatio on these text collections, our conclusion is that ConfWeight [Joachims, 1998b] T. Joachims, Making Large-Scale SVM could be used as a replacement to TFIDF with significant Learning Practical. LS8-Report, 24, Universität Dortmund, accuracy improvements on the average, as shown in Tables 1998. 2 and 3. Moreover, ConfWeight has the ability to perform very well even if no feature selection is conducted, [Lewis et al., 1996] D.D. Lewis, R. Schapire, J. Callan, and something depicted in the results presented in this paper. R. Papka. Training Algorithms for Linear Text Classifiers, Actually, when a feature is irrelevant to the classification In Proc. of ACM SIGIR, 298-306, 1996. task, the weight it gets from ConfWeight is so low that this is merely equivalent to the feature rejection by a feature [Lewis, 1997] D.D. Lewis. Reuters-21578 text selection process. TFIDF, on the other hand, always yields a categorization test collection, Distrib. 1.0, Sept 26. 1997. score higher than 0 (if the term occurs in the document for which TFIDF is computed) and this score is not related to the [Rose et al., 2001] T.G. Rose, M. Stevenson, and M. categorization problem, but only to the text collection as a Whitehead. The Reuters Corpus Volume 1 - from whole. Since feature selection is not inherent to TFIDF, yesterday's news to tomorrow's language resources. In Proc. many additional parameters (for instance, the feature of the Third International Conference on Language selection function to use and thresholds) need to be tuned to Resources and Evaluation, Spain, 29-31 May. 2001. achieve optimal results. [Debole and Sebastiani, 2003] argue for the use of [Salton et al., 1975] G. Salton, A. Wong, and C.S. Yang. A supervised methods to weight features (GainRatio and vector space model for information retrieval. Journal of the ConfWeight are two such methods). Despite positive results American Society for Information Science, 18(11):613-620, in some settings, GainRatio failed to show that supervised Nov. 1975. weighting methods are generally higher than unsupervised ones. We believe that ConfWeight is a promising supervised [Yang and Liu, 1999] Y. Yang and X. Liu. A re- weighting technique that behaves gracefully both with and examination of text categorization methods. In SIGIR-99, without feature selection. Therefore, we advocate its use in 1999. further experiments. [Wilson, 1927] E.B. Wilson. Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22, 209, 212. 1927.