International Journal of Computational Linguistics & Chinese Language Processing
Vol. 7, No. 1, February 2002

A Hybrid Approach for Automatic Classification of Chinese Unknown Verbs

曾慧馨, 劉昭麟, 高照明, 陳克健

In this paper we present a hybrid approach for automatic classification of Chinese unknown verbs. The first method of the hybrid approach utilizes a set of morphological rules summarized from the training data, i.e. the set of compound verbs extracted from Sinica corpus, to determine the category of an unknown compound verb. If the morphological rules are not applicable, then the instance-based categorization using the k-nearest neighbor method for the classification is employed. It was observed that some suffix morphemes are frequently occurred in compound verbs and also uniquely determine the syntactic categories of the resultant compound verbs. By processing and calculating the training data, 15 suffix rules with coverage over 2% and category prediction accuracy higher than 80% were derived. In addition to the above type of morphological rules, the reduplication rules are also useful for category prediction, such as some famous Chinese reduplication rules, like “aa” in two characters word, “aab”, “abb” and “aab” in three characters word etc. For instance,“
喝喝茶”has the same category as “喝茶,” and “研究研究” has the same category as“研究.” As a result, nine reduplication patterns are generated. Experimenting on the training data, it is found that the overall accuracy of the morphological rule classifier is 91.67% and its coverage is 23.19% only.

Since the coverage of the morphological rule classifier is low, an instance-based categorization method is employed to taking care the uncovered cases. The instance-based categorization utilizes similar examples to predict the category of an unknown verb. The lexical similarity was measured by both the semantic similarity and syntactic similarity. The semantic similarity between two words is measured by the semantic distance of their HowNet definitions and the syntactic similarity is measured by the distance of their syntactic categories. The distance between two syntactic categories is their cosine measure of their grammatical feature vectors derived from the Sinica Treebank. The category of an unknown verb is predicted as the same as the examples, which are most similar to the unknown verb according to the above criteria of the similarity. For testing on the training data, the optimal accuracy of instance-based categorization is 71.05%, when the similar examples are from unknown verbs and verbs in the dictionary (known verbs).

Both the morphological rule classifier and the instance-based categorization have the advantages of not only predicting the syntactic categories of the unknown words but also recognizing their morphological structures and major semantic classes. The advantage of the morphological rule classifier is its higher accuracy and for the instance-based categorization is its higher coverage. However, both of the methods have their own drawback; the former cannot be applied to most unknown verbs, but the latter suffers from low accuracy. For open test, 1000 unknown verbs that are unseen in the training process were tested. The accuracy of the linguistic rule is 87.25%, and the instance-based categorization is 65.04%. Finally, the overall accuracy of the hybrid approach is 70.80%.

Word Sense Disambiguation and Sense-Based NV Event Frame Identifier

Jia-Lin Tsai, Wen-Lian Hsu, Jeng-Woei Su

Word sense is ambiguous in natural language processing (NLP). This phenomenon is particularly keen in cases involving noun-verb (NV) word-pairs. This paper describes a sense-based noun-verb event frame (NVEF) identifier that can be used to disambiguate word sense in Chinese sentences effectively. A knowledge representation system (the NVEF-KR tree) for the NVEF sense-pair identifier is also proposed. We use the word sense of Hownet, which is a Chinese-English bilingual knowledge-base dictionary.

Our experiment showed that the NVEF identifier was able to achieve 74.8% accuracy for the test sentences studied based only on NVEF sense-pair knowledge. By applying the techniques of longest syllabic NVEF-word-pair first and exclusion word checking, the sense accuracy for the same test sentences could be further improved to 93.7%. There were four major reasons for the incorrect cases: (1) lack of a bottom-up tagger, (2) lack of non-NVEF knowledge, (3) inadequate word segmentation, and (4) lack of a multi-NVEF analyzer. If these four problems could be resolved, the accuracy would reach 98.9%.

The results of this study indicate that NVEF sense-pair knowledge is effective for word sense disambiguation and is likely to be important for general NLP.

word sense disambiguation, event frame, top-down identifier, Hownet

A Study of Semantic Disambiguation Based on HowNet

Yang Xiaofeng, Li Tangqiu

This thesis presents a description of a semantic disambiguation model applied in the syntax parsing process of the machine translation system.

The model uses Hownet as its main semantic resource, which is a common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents. It can provide rich semantic information for our disambiguation.

The model makes the word sense and structure disambiguation in the way of “preferring”. “preferring” is applied in the results produced by the parsing process. It combines the rule-based method and statistic based method.

First we extract from a large the co-occurrence information of each sense-atom. The corpus is untagged so the extracting process is unguided. We can construct restricted rules from the co-occurrence information according to certain transfer template. The semantic entry of a word in the Hownet is made of sense-atoms, so we can make out the restricted rules for each entry of any word.

During the course of disambiguation, the model constructs the context-related words set for each notational word in the input sentence. The semantic collocation relations between notional words can play a very important role in the syntax structure disambiguation. Our evaluation of some candidates is based on the degree of tightness of match between notional words in the structure. We compare the context-related words set of the word in the current structure with all the restricted rules of the word in the lexicon, and find the best match. Then the entry with the best match is taken as the word’s explanation. And the degree of similarity shows how the word in the structure matches with other notional words in it, so it can be taken as the reference of the notional words. Because the discrepancy of different candidate parses of a structure, the same word has different content-related words set, and so will get different scores. We can calculate the best match according to the score of all the notional words of the sentence. In this way we can solve the most of word sense disambiguation and structural disambiguation at the same time.

The semantic disambiguation model proposed in this thesis has been implemented in MTG system. Our experiment shows that the model is very effective for this purpose. And it is obviously more tolerant and much better than traditional YES or NO clear cut method.

In this thesis we first put forward the general idea of the method and give a brief introduce to the Hownet Dictionary. Then we give the methods of extracting co-occurrence information for each sense-atom from the corpus and transferring this information to restricted rules. Then the algorithm of disambiguation is proposed with detail, which includes constructing context-related words set, the calculation of the similarity between atom-senses, and between restricted-rules and the context-related sets. The experiment result given in the end of the paper shows that the method is effective.

Word Sense Disambiguation, Hownet, InterLigua, Sense Atom, Corpus, Semantic Environment

Cross-Language Text Filtering Based on Text Concepts and kNN

Weifeng Su, Shaozi Li, Tanqiu Li, Wenjian You

The WWW is increasingly being used source of information. The volume of information is accessed by users using direct manipulation tools. It is obviously that we'd like to have a tool to keep those texts we want and remove those texts we don't want from so much information flow to us. This paper describes a module that sifts through large number of texts retrieved by the user.

The module is based on HowNet, a knowledge dictionary developed by Mr. Zhendong Dong. In this dictionary, the concept of a word is divided into sememes. In the philosophy of HowNet, all concepts in the world can be expressed by a combination more than 1500 sememes. Sememe is a very useful concept in settle the problem of synonym which is the most difficult problem in text filtering. We classified the set of sememes into two sets of sememes: classfiable sememes and unclassficable semems. Classfiable sememes includes those sememes that are more useful in distinguishing a document's class from other documents. Unclassfiable sememes include those sememes that have similar appearance in all documents. Classfiable includes about 800 sememes. We used these 800 classficable sememes to build Classficable Sememes Vector Space(CSVS).

A text is represented as a vector in the CSVS after the following step:

  1. text preprosessing: Judge the language of the text and do some process attribute to its language.
  2. part-of-speech tagging
  3. keywords extraction
  4. keyword sense disambiguation based on its environment by calculating its classifiable sememes relevance with it's environment's classifiable sememes. We add the weight of a semantic item if there are classifiable sememes the same as classifiable sememe in the its environment word's semantic item. This is not a strict disambiguation algorithm. We just adjust the weights of those semantic items.
  5. those keywords are reduced to sememes and the weight of all keywords' all semantic items' classifiable sememes are calculated to be the weight of its vector feature.
A user provides some texts to express the text he interested in. They are all expressed as vectors in the CSVS. Then those vectors represent the user's preference. The relevance of two texts can be measured by using the cosine angle between the two text's vectors. When a new text comes, it is expressed as a vector in CSVS too. We find its k nearest neighbours in the texts provided by the user in the CSVS. Calculating the relevance of the new text to its k nearest neighbours and if it is bigger than a certain valve, than it means it is of the user's interest if smaller, it means that it is not belong to the user's interesting. The k is determined by calculated every training vector its neighbours.

Information filtering based on classifiable sememes has several advantage:

  1. Low dimentional input space. We use 800 sememes instead of 10000 words.
  2. Few irrelevant feature after the keyword extraction and unclassifiable sememes' removal.
  3. Document vector's feature's weight are big.
We made use of documents from eight different users in our experiments. All these users provides texts both in Chinese and English. We took into account the user's feedback and got a result of about 88 percent of recall and precision. It demonstrates that this is a success method.

Classfiable Sememe, Vector Space, kNN, Text Representation, HowNet