Extraction of Hierarchies Based on Inclusion of Co-occurring Words with Frequency Information Eiko Yamamoto Kyoko Kanzaki Hitoshi Isahara Computational Linguistics Group National Institute of Information and Communications Technology, 3-5 Hikari-dai, Seika-cho, Souraku-gun, Kyoto 619-0289, Japan eiko@nict.go.jp kanzaki@nict.go.jp isahara@nict.go.jp Abstract chical relations based on different data may be needed de- pending on the user. A statistical method of creating hierar- In this paper, we propose a method of automatically chies from corpora would thus be useful. We therefore at- extracting word hierarchies based on the inclusion tempted to automatically extract hierarchies most suited to relations of word appearance patterns in corpora. the information that a user handles. To do this, we extract We applied the complementary similarity measure hypernym-hyponym relations between two words from (CSM) to determine a hierarchical structure of word corpora and then build hierarchies by connecting these rela- meanings. The CSM is a similarity measure devel- tions. As the initial task, we attempted to extract hierarchies oped for recognizing degraded machine-printed text. of abstract nouns that co-occur with adjectives in Japanese. There are CSMs for both binary and gray-scale In finding word hierarchies in corpora, it is usual to use images. The CSM for binary images has been ap- patterns, such as "a part of," "is a," "such as," or "and," plied to estimate one-to-many relations, such as obtained from the corpora [Hearst, 1992; Berland and superordinate-subordinate relations, and to extract Charniak, 1999; Caraballo, 1999]. Methods for extracting word hierarchies. However, the CSM for gray-scale hypernyms of entry words from definition sentences in dic- images has not been applied to natural language tionaries [Tsurumaru et al., 1986; Shoutsu et al., 2003] and processing. Here, we apply the latter to extract word methods using collocations retrieved from corpora [Naka- hierarchies from corpora. To do this, we used fre- yama and Matsumoto, 1997] have been described previously. quency information for co-occurring words, which A hybrid method that uses both dictionaries and the de- is not considered when using the CSM for binary pendency relations of words taken from a corpus has also images. We compared our hierarchies with those been reported [Matsumoto et al., 1996]. Recently, a similar- obtained using the CSM for binary images, and ity measure developed for recognizing degraded ma- evaluated them by measuring their degree of chine-printed text [Hagita and Sawaki, 1995] was used to agreement with the EDR electronic dictionary. estimate one-to-many relations, such as that of superordi- nate-subordinate, from a corpus [Yamamoto and Umemura, 1 Introduction 2002]. This measure is called the complementary similarity measure (CSM) for binary images and indicates the degree of The hierarchical relations of words are useful language re- the inclusion relation between two binary vectors. In that sources. Progress is being made in lexical database research, study, each binary vector corresponds to certain appearance notably with hierarchical semantic lexical databases such as patterns for each term in a corpus. The CSM for binary im- WordNet [Miller et al., 1990] and the EDR electronic dic- ages has also been applied to extract word hierarchies from tionary [1995], which are used for natural language proc- corpora [Yamamoto et al., 2004] and to trace the distribution essing (NLP) research worldwide. These databases are es- of abstract nouns on a self-organizing semantic map [Kan- sential for enabling computers, and even humans, to fully zaki et al., 2004]. understand the meanings of words because the lexicon is the In the experiments described in this paper, we attempted to origin of language understanding and generation. In current use the CSM for gray-scale images [Sawaki et al., 1997] to thesauri with hierarchical relations, words are categorized extract hypernym-hyponym relations between two words. manually and classified in a top-down manner based on Specifically, we used not only binary vectors with elements human intuition. This may be a practical way of developing a of 0 or 1, but also vectors consisting of weights based on the lexical database for NLP. However, these word hierarchies frequencies of co-occurring words in corpora. We compared tend to vary greatly depending on the lexicographer. In fact, the hierarchies extracted using the CSM for gray-scale im- each thesaurus includes original hierarchical relations that ages with those extracted using the CSM for binary images. differ from those in other thesauri. There is often disagree- Finally, to verify the effectiveness of our approach, we ment as to the make-up of a hierarchy. In addition, hierar- evaluated our hierarchies by measuring the degree to which [2002]. They estimated these relations from the inclusion they agreed with the EDR electronic dictionary. relations between the appearance patterns of two words. An appearance pattern is expressed as an n-dimensional binary feature vector. Let F = (f1, f2, ..., fi, ..., fn), where fi = 0 or 1, 2 Experimental Data and let T = (t1, t2, ..., ti, ..., tn), where ti = 0 or 1, be the feature A good deal of linguistic research has focused on the syn- vectors of the appearance patterns for two words. The CSM tactic and semantic functions of abstract nouns [Nemoto, of F to T is then defined as follows: 1969; Takahashi, 1975; Kanzaki et al., 2003]. In the example, "Yagi (goat) wa seishitsu (nature) ga otonashii (gentle) ad - bc (Goats have a gentle nature)," Takahashi [1975] recognized CSM ( F , T ) = that the abstract noun "seishitsu (nature)" is a hypernym of (a + c)(b + d ) the attribute expressed by the predicative adjective "otonashi n n (gentle)." To classify adjectives on the basis on these func- a= f i ti , b= (1 - t i ), f . i =1 i i =1 tions, Kanzaki et al. [2003] defined such abstract nouns that c= d = co-occur with adjectives as hypernyms of these adjectives. n n (1 - f i ) t i , (1 - f i ) (1 - t i ), They produced linguistic data for their research by auto- i =1 i =1 matically extracting the co-occurrence relations between n = a+b+c+d abstract nouns and adjectives from corpora. In our experiment, we used the same corpora and abstract In our experiment, each "word" is an abstract noun, and n nouns as Kanzaki et al.. The corpora consist of 100 novels, is the number of adjective types in the corpus (6407). 100 essays, and 42 years' worth of newspaper articles, in- Therefore, a indicates the number of adjective types cluding 11 years of the Mainichi Shinbun, 10 years of the co-occurring with both abstract nouns and b indicates the Nihon Keizai Shinbun, 7 years of the Sangyou- number of adjective types co-occurring only with the abstract kinyuuryuutsuu Shinbun, and 14 years of the Yomiuri noun corresponding to F. In contrast, c indicates the number Shinbun. The abstract nouns were selected from 2 years' of adjective types co-occurring only with the abstract noun worth of the Mainichi Shinbun newspaper articles by Kan- corresponding to T and d indicates the number of adjective zaki et al.. types that do not co-occur with either abstract noun. However, we produced our experimental data using only the sentences in the corpora that could be parsed using KNP, 3.2 Complementary similarity measure for which is a Japanese parser developed at Kyoto University. gray-scale images The parsed data consisted of 354 abstract nouns and 6407 adjectives and included the following examples (the number The CSM for gray-scale images is an extension of the CSM after each adjective is the frequency which the abstract noun for binary images. Although the CSM for binary images is co-occurs with the adjective): robust against graphical design, it is strongly affected by binarization or scanning conditions [Sawaki et al., 1997]. OMOI (feeling): ureshii (glad) 25, kanashii (sad) 396, The CSM for binary images is a special case of the shiawasena (happy) 6, ... four-fold point correlation coefficient. Therefore, Sawaki et KIMOCHI (thought): ureshii (glad) 204, tanoshii al. [1997] defined the CSM for gray-scale images as a gen- (pleasant) 87, hokorashii (proud) 40, ... eral form of the four-fold point correlation coefficient. Be- cause it handles gray-scale images directly, this CSM is less KANTEN (viewpoint): igakutekina (medical) 9, reki- affected by binarization or scanning conditions. shitekina (historical) 17, ... Let Fg = (fg1, fg2, ..., fgi, ..., fgn), where fgi = 0 through 1, and let Tg = (tg1, tg2, ..., tgi, ..., tgn), where tgi = 0 through 1, be the 3 Complementary Similarity Measure feature vectors of two gray-scale patterns. Then, the CSMg of Fg to Tg is defined as follows: As mentioned above, we used the complementary similarity measure (CSM) to estimate the hierarchical relations be- a g d g - bg c g tween word pairs. The CSM was developed for recognizing CSM g ( Fg , Tg ) = degraded machine-printed text [Hagita and Sawaki, 1995; nTg 2 - Tg2 Sawaki et al., 1997]. There are two kinds of CSMs, one for binary images and one for gray-scale images. n n ag = f gi t gi , bg = (1 - t gi ), f i =1 gi . 3.1 Complementary similarity measure for binary i =1 = = n n images (1 - f gi ) t gi , d g (1 - f gi ) (1 - t gi ), cg i =1 i =1 The CSM for binary images was developed as a character = n n2 recognition measure for binary images and is designed to be Tg 2 = Tg t, t i =1 gi i =1 gi robust against heavy noise or graphical design [Hagita and Sawaki, 1995]. It was applied to estimate one-to-many rela- tionships between words by Yamamoto and Umemura 4. If a short hierarchy is included in a longer hierarchy In our experiment, fgi and tgi are the weights based on the and the order of the words stays the same, the short frequency at which an abstract noun co-occurred with an i-th one is dropped from the list of hierarchies. type of adjective. In this paper, we used the following weighting function, where Freq(noun, adj) is the frequency at which the abstract noun co-occurs with the adjective: 5 Parameters The conditions of our experiments were set as follows: Freq(noun, adj) . Weight(noun, adj) = CSM for binary images: TH = 0.2; Freq(noun, adj) + 1 CSM for gray-scale images: TH = 0.12. If we set TH to a low value, it is possible to obtain long hi- We paid particular attention to situations in which the erarchies. When the TH is too low, however, the number of noun co-occurred with the adjective. If the noun does not word pairs that have to be considered becomes overwhelm- co-occur with the adjective, that is, Freq(noun, adj) is 0, the ing and the reliability of the measurement decreases. We weight is 0.0. If Freq(noun, adj) is 1, it is 0.5. If Freq(noun, experimentally set TH as shown above so as to obtain "koto adj) was more than 1, the weight increased gradually until it (matter)" as the top word of all hierarchies. Because "koto" approaches 1.0. This is because information on whether or co-occurred with the most number of adjectives, we pre- not the noun co-occurs with the adjective is more important dicted that "koto" would be at the top of the hierarchies. than information on how many times the noun co-occurs with the adjective. 6 Comparison and Evaluation 4 Hierarchy Extraction Process 6.1 Overlap between extracted hierarchies Word hierarchies were extracted as follows, where "TH" is a threshold value for each word pair under consideration; First, we compared the hierarchies obtained using the CSM for binary images (CSMb) and the CSM for gray-scale im- 1. Compute the similarity between appearance patterns ages (CSMg). Table 1 lists the number of extracted hierar- for each pair of words. The hierarchical relation be- chies. The CSMb extracted 189 hierarchies, while the CSMg tween the two words in a pair is determined by the extracted 178. There were only 28 common hierarchies with similarity value. The pair is expressed as (X, Y), most of them having depths ranging from 3 to 6. One of the where X is a hypernym of Y and Y is a hyponym of X. common hierarchies is shown below. 2. Sort the pairs by their normalized values and elimi- nate pairs with values below TH. koto (matter) --- jyoutai (state) --- 3. For each abstract noun C: kankei (relation) --- Choose the hypernym-hyponym pair (C, D) with i tsunagari (relationship) --- the highest value. This pair (C, D) is placed in the en (ties/bonds). initial hierarchy. The number of hierarchies obtained by the CSMg that in- ii Choose a pair (D, E) such that the hyponym E is cluded one or more hierarchies obtained by the CSMb was not contained in the current hierarchy and (D, E) higher than the number obtained by the CSMb that included has the highest value among the pairs where word one or more obtained by the CSMg, i.e., (D) < (E). This D, at the bottom of the current hierarchy, is a hy- suggests that the CSMg might be able to extract longer hi- pernym. erarchies than the CSMb can. Connect the hyponym E to D at the bottom of the current hierarchy. Type of hierarchy Number (A) Hierarchies obtained by CSMb 189 iii Choose another pair (E, F) according to the pre- (B) Hierarchies obtained by CSMg 178 vious step, and repeat the process until no more such pairs can be chosen. (C) Common hierarchies 28 (D) Inclusion of CSMg hierarchies iv Choose a pair (B, C) such that the hypernym B is in CSMb hierarchies 5 not contained in the current hierarchy and (B, C) (E) Inclusion of CSMb hierarchies has the highest value among the pairs where the in CSMg hierarchies 38 top word C of the current hierarchy is a hyponym. Table 1: Number of extracted hierarchies Connect the hypernym B in front of C at the top of the current hierarchy. The CSMg hierarchies shown below include one or more v Choose another pair (A, B) according to the pre- CSMb hierarchies. The underlined nouns are those that ap- vious step, and repeat the process until no more pear in one of the CSMb hierarchies. such pairs can be chosen. koto (matter) --- Hierarchy in the EDR electronic dictionary tokoro (point) --- The EDR electronic dictionary [1995], which was developed for advanced processing of natural language by computers, is imeeji (image) --- composed of 11 sub-dictionaries, including a concept dic- inshou (impression) --- tionary, word dictionaries, bilingual dictionaries, etc. gaiken (appearance) --- Although we could verify and analyse our extracted hier- monogoshi (manner) --- archies by comparing them with the EDR dictionary, there kihin (elegance) --- were two problems with this approach. First, many concepts hinkaku (grace) --- in the EDR dictionary are defined by sentences or phrases, kettou (pedigree) --- whereas the concepts in our extracted hierarchies are defined kakei (family line) by abstract nouns only. Therefore, we replaced the sentences koto (matter) --- and phrases in the EDR concept definitions with sequences of tokoro (point) --- words for a more accurate comparison. Secondly, there was a shigusa (behavior) --- difference of words between those used to define concepts in omokage (visage) --- the EDR and the abstract nouns used for our extracted hier- kawaisa (loveliness) archies. To solve this problem, we extracted synonyms from the EDR dictionary and added synonyms to both the words in In fact, the depths of the CSMb hierarchies ranged from 3 the EDR concept definitions and the abstract nouns in our to 12, while the depths of the CSMg hierarchies ranged from hierarchies. We thus transformed the conceptual hierarchies 3 to 15. We also found that the CSMg extracted more deep of adjectives in the EDR dictionary into hierarchies consist- hierarchies than the CSMb but fewer shallow hierarchies. ing of sequences of word sets to enable a comparison with Overall, these results show that the CSMg extracted deeper our hierarchies consisting of adjective hypernyms. hierarchies than did the CSMb, though the number of ex- tracted hierarchies was smaller (see Figure 1). Measurement of agreement level For the comparison, we measured the degree of agreement with an EDR hierarchy for each extracted hierarchy, i.e., we 50 counted the number of nodes that agreed with nodes in the Number of hierarchy 40 CSMg extracted corresponding EDR hierarchy, while maintaining the order more deep hierarchies 30 of each hierarchy. A node in one of our hierarchies is a set of than CSMb. abstract nouns and their synonyms. We represented the node 20 by Node(abstract noun, synonym1, synonym2, ...), where 10 Node was a name identifying the node, while a node in an 0 3 4 5 6 7 8 9 10 11 12 13 14 15 EDR hierarchy is a set of sequences of words and synonyms 4 33 45 36 22 1 6 20 7 2 4 CSMb and was represented by Node(content word1, synonym11, 5 29 24 32 29 1 8 10 1 6 8 4 1 1 1 CSMg synonym12, ..., content word2, synonym21, synonym22, ...). Depth of hierarchy Therefore, if a word in a node of one of our hierarchies was Figure 1: Distribution of hierarchies by depth included in a node of an EDR hierarchy, we considered that the nodes agreed. For example, if our hierarchy is A(a, a', a") --- 6.2 Agreement with the concept hierarchies in B(b, b', b") --- the EDR electronic dictionary C(c, c', c") --- Next, we compared each of the hierarchies obtained by the D(d, d', d") CSMb and the CSMg with the concept hierarchies for ad- and a corresponding EDR hierarchy is jectives in the EDR electronic dictionary. In this paper, we extracted the hierarchies from corpora P(a, a', x, x') --- consisting mostly of newspaper articles. Because the news- Q (b, b") --- paper articles cover a wide range of topics, the information R(r, r', r") --- we obtain from our corpora is general knowledge. Therefore, S(s, s', d, f, f', f") --- it is reasonable to compare our hierarchy of abstract nouns T(t, t', g, g'), with existing general purpose hierarchies such as the EDR we count three agreement nodes, because A, B, D match to P, hierarchy. Q, S, respectively. The bold words indicate words which The number of hierarchies for adjectives obtained from the match between our hierarchy and an EDR hierarchy. And we EDR dictionary is 932, and the maximum depth is 14, define the level of agreement as three. whereas our hierarchies had maximum depths of 12 and 15, In comparing hierarchies, we found cases in which a hy- as noted above (Figure 1). Because both the EDR and our two pernym and its hyponym in our hierarchy were treated as types of hierarchies had similar maximum depths, it was synonyms in the EDR electronic dictionary. For example, appropriate to evaluate our hierarchies by comparing them consider the following hierarchy obtained using our ap- with those of the EDR dictionary. proach: was a hypernym and the one on the right was a hyponym. The koto (matter) --- CSMb estimated the reverse. tokoro (point) --- imeeji (image) --- funiki (atmosphere) --- Depth of Agreement level Ave. kuuki (atmosphere in a place) --- hierarchy 1 2 3 4567 8 9 level kanjyou (feeling/emotion) --- 3 1 2 1 2.00 shinjyou (one's feelings/one's sentiment) --- 4 6 18 9 3.09 shinkyou (mental state/one's heart) --- 5 7 23 12 3 3.24 kangai (deep emotion) --- 6 4 12 9 7 4 3.86 omoide (memories). 7 2 2 10 4 3 1 4.32 8 1 6 6 3 4.69 In the EDR electronic dictionary, each word is linked to its 9 1 4 5 4 5 1 5.55 concept, and a synonymous relation is defined as words 10 2 2 2 1 5.29 linked to the same concept. That is, we can gather synonyms 11 1 1 4.50 via EDR dictionary. In fact, in the above hierarchy, we ob- 12 1 1 2 6.25 tained "shinjyou (one's feelings/one's sentiment)" and Overall ave. 4.28 "shinkyou (mental state/one's heart)" as synonyms of "kan- jyou (feeling/emotion)" from the EDR dictionary. Also, we Table 2: Distribution of CSMb hierarchies for various agreement know that "kuuki (atmosphere in a place)" is a synonym of levels "funiki (atmosphere)". Depth of Ave. Agreement level If we count the agreement of the above hierarchy with the hierarchy 1 2 3 4567 8 9 level EDR dictionary strictly, the level of agreement is 6. The 3 1 3 1 2.50 agreement nodes are "koto (matter) -- tokoro (point) -- imeeji 4 6 13 10 3.14 (image) -- funiki (atmosphere) or kuuki (atmosphere in a 5 3 9 9 3 3.50 place) -- kanjyou (feeling/emotion), shinjyou (one's feel- 6 1 11 12 6 2 3.91 ings/one's sentiment), or shinkyou (mental state/one's heart) 7 1 5 10 8 5 4.38 -- omoide (memories)". However, if we accept hy- 8 4 5 7 2 4.39 pernym-hyponym relations among synonyms, the agreement 9 6 1 3 4.70 level is 9. In this case, the agreement nodes are "koto (matter) 10 2 6 4 3 1 6.69 -- tokoro (point) -- imeeji (image) -- funiki (atmosphere) -- 11 1 2 1 3 1 5.13 kuuki (atmosphere in a place) -- kanjyou (feeling/emotion) -- 12 1 3 8.75 shinjyou (one's feelings/one's sentiment) -- shinkyou (mental 13 1 7.00 state/one's heart) -- omoide (memories)". 14 1 8.00 Results 15 1 9.00 Table 2 shows the distribution of CSMb hierarchies for Overall ave. 5.47 various agreement levels. Table 3 shows the same results for Table 3: Distribution of CSMg hierarchies for various agreement the CSMg. The numbers in italics in the tables indicate that levels the number of hierarchies at that depth which are completely included in an EDR hierarchy. In the last column, we show Noun pair the average agreement level at each depth. The value in the tokoro (point) imeeji (image) bottom right-hand corner is the average agreement level for all hierarchies. tokoro (point) men (side) Figure 2 is a graph of the average agreement level at each tokoro (point) inshou (impression) depth shown in Tables 2 and 3. In Figure 2, except at the tokoro (point) seikaku (character) depths of 8 and 9, the average agreement levels for the CSMg tokoro (point) seishitsu (property) hierarchies are higher than those of the CSMb hierarchies. As tokoro (point) kanshoku (touch) shown in Tables 2 and 3, the deeper hierarchies tended to have higher agreement levels. Therefore, we consider that kimochi omoi overall the CSMg hierarchies were closer to the EDR hier- (thought/feeling/intention) (feeling/mind/expectation) archies than were the CSMb hierarchies. That is, the CSMg kagayaki (brightness) koutaku (gloss) hierarchies were more in accordance with human intuition kuukan (space) men (side) than were the CSMb hierarchies. kotoba (speech) iken (opinion) We also verified the ability of the CSM to estimate hy- kokoro (mind) shinjyou pernym-hyponym relations between two nouns. Some of (one's feelings/one's sentiment) noun pairs whose relations were estimated by two CSMs hiyori (fine weather) ondo (temperature) were opposite each other. Table 4 shows some such pairs. For each of them, the CSMg estimated that the noun on the left Table 4: Noun pairs estimated oppositely 9 .0 0 8 .0 0 CSMb CSMg Average of agreement level 7 .0 0 6 .0 0 5 .0 0 4 .0 0 3 .0 0 2 .0 0 1 .0 0 - 3 4 5 6 7 8 9 10 11 12 13 14 15 Overall Depth of hierarchy Figure 2: Comparison of CSMb and CSMg hierarchies by average agreement level In our experiment, there were 836 such pairs. As the total As shown in Table 5, the values of CSMg in both direc- number of pairs considered was 17201, these pairs amounted tions were similar, as was the case for CSMb. As the CSM is to less than 5%. We also found that in most cases these pairs a measure of inclusion, if the values of the CSM for two appeared in the middle of a hierarchy. words in both directions are similar, it might mean that the two words are synonymous. Both the CSMg and CSMb Let us consider the hypernym-hyponym relation between estimated that "kimochi" and "omoi" were similar. However, "kimochi (thought/feeling/intention)" and "omoi (feel- due to the very small differences in values, they extracted ing/mind/expectation)". A CSMg hierarchy including "ki- opposite results. In fact, in the EDR electronic dictionary, mochi" and "omoi" was as follows: "kimochi" and "omoi" are synonymous because both have the meaning of "feeling". The pairs estimated oppositely by koto (matter) --- the two CSMs may have a synonymous relation. In future tokoro (point) --- work, we will introduce the CSM-based definition to esti- imeeji (image) --- mate two words as synonyms. inshou (impression) --- A small number of word pairs were estimated oppositely kanji (feeling/sense) --- by the two CSMs, with large differences in the CSM value of kibun (feeling/mood) --- each direction. For example, CSMg estimated that "tokoro kimochi (thought/feeling/intention) --- (point)" was a hypernym of "imeeji (image)" and CSMb omoi (feeling/mind/expectation) --- estimated that "imeeji" was a hypernym of "tokoro" (see negai (wish) --- Table 6). nen (desire). Here, "kimochi" was estimated as a hypernym of "omoi." (F, T) ("tokoro", "imeeji") ("imeeji", "tokoro") Diff. However, CSMb estimated the opposite, i.e., "omoi" was a CSMb 0.6767 0.7156 +0.0389 hypernym of "kimochi." We examined the values given by CSMg 0.6631 0.6468 -0.0163 the CSMb and CSMg to "kimochi" and "omoi" (see Table 5). Table 6: Differences in CSM values for "tokoro" and "imeeji" (F, T) ("omoi", "kimochi") ("kimochi", "omoi") Diff. The introduction of frequency information for CSMg may CSMb 0.8094 0.8064 0.0030 be the reason for this difference. In future work, we will CSMg 0.7632 0.7700 0.0068 analyze this in more detail. Table 5: Differences in CSM values for "omoi" and "kimochi" [Kanzaki et al., 2004] Kanzaki, K., Yamamoto, E., Ma, Q. 7 Conclusion and Isahara, H. Construction of an objective hierarchy of We proposed a method of automatically extracting hierar- abstract concepts via directional similarity. In Proceed- chies based on the inclusion relations of the appearance pat- ings of 20th International Conference on Computational terns of words from corpora. In this paper, we described our Linguistics, Vol., pp. 1147-1153, 2004. attempts to extract objective hierarchies of abstract nouns [Matsumoto et al., 1996] Matsumoto, Y., Sudo, S., Naka- co-occurring with adjectives in Japanese. We applied the yama, T. and Hirao, T. Thesaurus construction from complementary similarity measure for gray-scale images multiple language resources, In IPSJ SIG Notes NL-93, (CSMg) to search for better hierarchical word structures. In pp. 23-28, 1996. our experiment, we found that the CSMg could extract word hierarchies from corpora, even though it was developed for [Miller et al., 1990] Miller, A., Beckwith, R., Fellbaum, C., recognizing degraded machine-printed text. We also com- Gros, D., Millier, K. and Tengi, R. Five Papers on pared the CSMg with the CSM for binary images (CSMb), WordNet, Technical Report CSL Report 43, Cognitive and found that the CSMg hierarchies were more in accor- Science Laboratory, Princeton University, 1990. dance with human intuition than were the CSMb hierarchies, [Nakayama and Matsumoto, 1997] Nakayama, T. and Ma- as measured by their degree of agreement with the EDR tsumoto, Y. Positioning nouns in a classification-based electronic dictionary. As a next step, it would be interesting thesaurus, In IPSJ SIG Notes NL-120, pp. 103-108, 1997. to compare this method to existing statistical methods such as [Nemoto, 1969] Nemoto, K. Combination of noun with agglomerative clustering. It would also be necessary to es- "ga-case" and adjective, Language Research for the timate confidence. Computer 2, National Language Research Institute, pp. In this paper, we thus verified the suitability of the pro- 63-73, 1969 (In Japanese). posed method for extracting hierarchies from corpora. We consider that hierarchies tuned to specific corpora could be [Sawaki et al., 1997] Sawaki, M., Hagita, N. and Ishii, K. used for query expansion in information retrieval for specific Robust character recognition of gray-scaled images with domains. Our future work will include extracting hierarchies graphical designs and noise, In Proceedings of the In- of key words in corpora and trying to utilize them as a special ternational Conference on Document Analysis and Rec- thesaurus for domain-oriented information retrieval. ognition, pp. 491-494, 1997. [Shoutsu et al., 2003] Shoutsu, Y., Tokunaga, T. and Tanaka, References H. The integration of Japanese dictionary and thesaurus, In IPSJ SIG Notes NL-153, pp. 141-146, 2003. [Berland and Charniak, 1999] Berland, M. and Charniak, E. Finding parts in very large corpora, In Proceedings of the [Takahashi, 1975] Takahashi, T. Various phase related to 37th Annual Meeting of the Association for Computational part-whole relation investigated in sentence, Studies in Linguistics, pp. 57-64, 1999. the Japanese Language 103, The Society of Japanese Linguistics, pp. 1-16, 1975 (In Japanese). [Caraballo, 1999] Caraballo, S. A. Automatic construction of a hypernym-labeled noun hierarchy from text, In Pro- [Tsurumaru et al., 1986] Tsurumaru, H., Hitaka, T. and ceedings of the 37th Annual Meeting of the Association for Yoshida, S. Automatic extraction of hierarchical relation Computational Linguistics, pp. 120-126, 1999. between words, In IPSJ SIG Notes NL-83, pp. 121-128, 1986. [EDR, 1995] EDR Electronic Dictionary. 1995. [Yamamoto and Umemura, 2002] Yamamoto, E. and http://www2.nict.go.jp/kk/e416/EDR/index.html Umemura, K. A similarity measure for estimation of [Hagita and Sawaki, 1995] Hagita, N. and Sawaki, M. Robust one­to-many relationship in corpus, In Journal of Natu- recognition of degraded machine-printed characters using ral Language Processing, pp. 45-75, 2002. complementary similarity measure and error-correction [Yamamoto et al., 2004] Yamamoto, E., Kanzaki, K. and learning, In Proceedings of the SPIE ­ The International Isahara, H. Hierarchy extraction based on inclusion of Society for Optical Engineering, 2442: pp. 236-244, appearance, In ACL04 Companion Volume to the Pro- 1995. ceedings of the Conference, pp. 149-152, 2004. [Hearst, 1992] Hearst, M. A. Automatic acquisition of hy- ponyms from large text corpora, In Proceedings of the 14th International Conference on Computational Lin- guistics, pp. 539-545, 1992. [Kanzaki et al., 2003] Kanzaki, K., Ma, Q., Yamamoto, E., Murata, M. and Isahara, H. Adjectives and their abstract concepts --- toward an objective thesaurus from semantic map. In Proceedings of the Second International Work- shop on Generative Approaches to the Lexicon, pp. 177-184, 2003.