Author:
�㦛�折成,
��㗇嵰暻�,
擃条�扳��,
�䒰�见��
Abstract:
In this paper we present a hybrid approach for automatic classification of Chinese unknown verbs. The first method of the hybrid approach utilizes a set of morphological rules summarized from the training data, i.e. the set of compound verbs extracted from Sinica corpus, to determine the category of an unknown compound verb. If the morphological rules are not applicable, then the instance-based categorization using the k-nearest neighbor method for the classification is employed. It was observed that some suffix morphemes are frequently occurred in compound verbs and also uniquely determine the syntactic categories of the resultant compound verbs. By processing and calculating the training data, 15 suffix rules with coverage over 2% and category prediction accuracy higher than 80% were derived. In addition to the above type of morphological rules, the reduplication rules are also useful for category prediction, such as some famous Chinese reduplication rules, like �𦡞a�� in two characters word, �𦡞ab��, �𦡞bb�� and �𦡞ab�� in three characters word etc. For instance,����嘥�肽薗�𩣑as the same category as ����肽薗,�� and ����𠉛弦��𠉛弦�� has the same category as����𠉛弦.�� As a result, nine reduplication patterns are generated. Experimenting on the training data, it is found that the overall accuracy of the morphological rule classifier is 91.67% and its coverage is 23.19% only.
Since the coverage of the morphological rule classifier is low, an instance-based categorization method is employed to taking care the uncovered cases. The instance-based categorization utilizes similar examples to predict the category of an unknown verb. The lexical similarity was measured by both the semantic similarity and syntactic similarity. The semantic similarity between two words is measured by the semantic distance of their HowNet definitions and the syntactic similarity is measured by the distance of their syntactic categories. The distance between two syntactic categories is their cosine measure of their grammatical feature vectors derived from the Sinica Treebank. The category of an unknown verb is predicted as the same as the examples, which are most similar to the unknown verb according to the above criteria of the similarity. For testing on the training data, the optimal accuracy of instance-based categorization is 71.05%, when the similar examples are from unknown verbs and verbs in the dictionary (known verbs).
Both the morphological rule classifier and the instance-based categorization have the advantages of not only predicting the syntactic categories of the unknown words but also recognizing their morphological structures and major semantic classes. The advantage of the morphological rule classifier is its higher accuracy and for the instance-based categorization is its higher coverage. However, both of the methods have their own drawback; the former cannot be applied to most unknown verbs, but the latter suffers from low accuracy. For open test, 1000 unknown verbs that are unseen in the training process were tested. The accuracy of the linguistic rule is 87.25%, and the instance-based categorization is 65.04%. Finally, the overall accuracy of the hybrid approach is 70.80%.
Author:
Jia-Lin Tsai, Wen-Lian Hsu, Jeng-Woei Su
Abstract:
Word sense is ambiguous in natural language processing (NLP). This phenomenon is particularly keen in cases involving noun-verb (NV) word-pairs. This paper describes a sense-based noun-verb event frame (NVEF) identifier that can be used to disambiguate word sense in Chinese sentences effectively. A knowledge representation system (the NVEF-KR tree) for the NVEF sense-pair identifier is also proposed. We use the word sense of Hownet, which is a Chinese-English bilingual knowledge-base dictionary.
Our experiment showed that the NVEF identifier was able to achieve 74.8% accuracy for the test sentences studied based only on NVEF sense-pair knowledge. By applying the techniques of longest syllabic NVEF-word-pair first and exclusion word checking, the sense accuracy for the same test sentences could be further improved to 93.7%. There were four major reasons for the incorrect cases: (1) lack of a bottom-up tagger, (2) lack of non-NVEF knowledge, (3) inadequate word segmentation, and (4) lack of a multi-NVEF analyzer. If these four problems could be resolved, the accuracy would reach 98.9%.
The results of this study indicate that NVEF sense-pair knowledge is effective for word sense disambiguation and is likely to be important for general NLP.
Keyword:
word sense disambiguation, event frame, top-down identifier, Hownet
Author:
Yang Xiaofeng, Li Tangqiu
Abstract:
This thesis presents a description of a semantic disambiguation model applied in the syntax parsing process of the machine translation system.
The model uses Hownet as its main semantic resource, which is a common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents. It can provide rich semantic information for our disambiguation.
The model makes the word sense and structure disambiguation in the way of �𦑩referring��. �𦑩referring�� is applied in the results produced by the parsing process. It combines the rule-based method and statistic based method.
First we extract from a large the co-occurrence information of each sense-atom. The corpus is untagged so the extracting process is unguided. We can construct restricted rules from the co-occurrence information according to certain transfer template. The semantic entry of a word in the Hownet is made of sense-atoms, so we can make out the restricted rules for each entry of any word.
During the course of disambiguation, the model constructs the context-related words set for each notational word in the input sentence. The semantic collocation relations between notional words can play a very important role in the syntax structure disambiguation. Our evaluation of some candidates is based on the degree of tightness of match between notional words in the structure. We compare the context-related words set of the word in the current structure with all the restricted rules of the word in the lexicon, and find the best match. Then the entry with the best match is taken as the word�䏭 explanation. And the degree of similarity shows how the word in the structure matches with other notional words in it, so it can be taken as the reference of the notional words. Because the discrepancy of different candidate parses of a structure, the same word has different content-related words set, and so will get different scores. We can calculate the best match according to the score of all the notional words of the sentence. In this way we can solve the most of word sense disambiguation and structural disambiguation at the same time.
The semantic disambiguation model proposed in this thesis has been implemented in MTG system. Our experiment shows that the model is very effective for this purpose. And it is obviously more tolerant and much better than traditional YES or NO clear cut method.
In this thesis we first put forward the general idea of the method and give a brief introduce to the Hownet Dictionary. Then we give the methods of extracting co-occurrence information for each sense-atom from the corpus and transferring this information to restricted rules. Then the algorithm of disambiguation is proposed with detail, which includes constructing context-related words set, the calculation of the similarity between atom-senses, and between restricted-rules and the context-related sets. The experiment result given in the end of the paper shows that the method is effective.
Keyword:
Word Sense Disambiguation, Hownet, InterLigua, Sense Atom, Corpus, Semantic Environment
Author:
Weifeng Su, Shaozi Li, Tanqiu Li, Wenjian You
Abstract:
The WWW is increasingly being used source of information. The volume of information is accessed by users using direct manipulation tools. It is obviously that we'd like to have a tool to keep those texts we want and remove those texts we don't want from so much information flow to us. This paper describes a module that sifts through large number of texts retrieved by the user.
The module is based on HowNet, a knowledge dictionary developed by Mr. Zhendong Dong. In this dictionary, the concept of a word is divided into sememes. In the philosophy of HowNet, all concepts in the world can be expressed by a combination more than 1500 sememes. Sememe is a very useful concept in settle the problem of synonym which is the most difficult problem in text filtering. We classified the set of sememes into two sets of sememes: classfiable sememes and unclassficable semems. Classfiable sememes includes those sememes that are more useful in distinguishing a document's class from other documents. Unclassfiable sememes include those sememes that have similar appearance in all documents. Classfiable includes about 800 sememes. We used these 800 classficable sememes to build Classficable Sememes Vector Space(CSVS).
A text is represented as a vector in the CSVS after the following step:
Information filtering based on classifiable sememes has several advantage:
Keyword:
Classfiable Sememe, Vector Space, kNN, Text Representation, HowNet