陳祖舜(Zusun Chen), 周強(Qiang Zhou), 趙強(Qiang Zhao)
The characteristic and an advantage of natural language is that, as a symbolic system, it has an internal logical framework for organizing and positioning conceptual knowledge, which is its lexicon system. This framework implements the fundamental function of natural language to condense, absorb, organize and position conceptual knowledge, and creates progressively a very large and complex build-in knowledge system in the language. It is also the basis of the other two fundamental functions of natural language; i.e., it serves as a tool for communication and as a medium for conceptual thought. The natural language semantics should reproduce the basic framework of natural language in their theoretic realms to represent these three functions and their relationships. The lexical semantics thereby become their core.
A word is the symbolic embodiment of a concept, and a concept is generated in a peculiar cognition scheme, which will be called its generating scheme. We cannot describe and define a concept clearly unless we put it into its generating scheme. Meanwhile, the implementation of the concept involves a procedure that contrasts, restores, and refers to its generating scheme in a special environment, which will be called its application scheme.
We propose to use the situation as a mathematical model to describe a cognition scheme. Therefore, the situation theory can serve as a unified theoretical framework for constructing the lexical semantics and the natural language semantics built upon it, as mentioned above. Therefore, many new viewpoints are proposed. In this paper, only some elementary questions about them are discussed, including: 1) using a situation to express a scheme and using a situation to describe a concept (this is the key point of the paper); 2) formulating the situation algebra for describing relations, transformations, and operations for situations so as to simulate conceptual thinking by means of algebraic calculus; 3) constructing a situation network to implement a scheme structure and conceptual structure, where the key point is the constitution and organization of a semantic dictionary. We use some practical cases to illustrate these methods. The mathematical theory relevant to them will be presented in our future papers.
Concept, lexical meaning, situation, situation algebra, semantic dictionary, lexical semantics
Keh-Jiann Chen, Jia-Ming You
There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example-based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy. The taxonomy approaches are more or less semantic-based that do not consider syntactic similarities. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarities.
In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word, and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.
劉群 (Qun LIU), 李素建 (Sujian LI)
Word similarity is broadly used in many applications, such as information retrieval, information extraction, text classification, word sense disambiguation, example-based machine translation, etc. There are two different methods used to compute similarity: one is based on ontology or a semantic taxonomy; the other is based on collocations of words in a corpus.
As a lexical knowledgebase with rich semantic information, How-net has been employed in various researches. Unlike other thesauri, such as WordNet and Tongyici Cilin, in which word similarity is defined based on the distance between words in a semantic taxonomy tree, How-net defines a word in a complicated multi-dimensional knowledge description language. As a result, a series of problems arise in the process of word similarity computation using How-net. The difficulties are outlined below:
The description of each word consists of a group of sememes. For example, the Chinese word “暗箱(camera obscura)” is described as: “part|部件, #TakePicture|拍攝, %tool|用具, body|身”, and the Chinese word “寫信(write a letter)” is described as: “write|寫, ContentProduct=letter|信件”;
The meaning of a word is not a simple combination of these sememes. Sememes are organized using a specific knowledge description language.
To meet these challenges, our work includes:
A study on the How-net knowledge description language. We rewrite the How-net definition of a word in a more structural format, using the abstract data structure of set and feature structure.
A study on the algorithm used to compute word similarity based on How-net. The similarity between sememes, that between sets, and that between feature structures are given. To compute the similarity between two sememes, we use the distance between the sememes in the semantic taxonomy, as is done in Wordnet and Tongyici Cilin. To compute the similarity between two sets or two feature structures, we first establish a one-to-one mapping between the elements of the sets or the feature structures. Then, the similarity between the sets or feature structures is defined as the weighted average of the similarity between their elements. For feature structures, a one-to-one mapping is established according to the attributes. For sets, a one-to-one mapping is established according to the similarity between their elements.
Finally, we give experiment results to show the validity of the algorithm and compare them with results obtained using other algorithms. Our results for word similarity agree with people’s intuition to a large extent, and they are better than the results of two comparative experiments.
How-net, Word Similarity Computing, Natural Language Processing
王惠 (WANG Hui)
Word sense disambiguation (WSD) plays an important role in many areas of natural language processing, such as machine translation, information retrieval, sentence analysis, and speech recognition. Research on WSD has great theoretical and practical significance. The main purposes of this study were to study the kind of knowledge that is useful for WSD, and to establish a new WSD model based on syntagmatic features, which can be used to disambiguate noun sense in Mandarin Chinese effectively.
Close correlation has been found between lexical meaning and its distribution. According to a study in the field of cognitive science [Choueka, 1983], people often disambiguate word sense using only a few other words in a given context (frequently only one additional word). Thus, the relationships between one word and others can be effectively used to resolve ambiguity. Based on a descriptive study of more than 4,000 Chinese noun senses, a multi-level framework of syntagmatic analysis was designed to describe the syntactic and semantic constraints of Chinese nouns. All of these polyseme nouns were surveyed, and it was found that different senses have different and complementary distributions at the syntax and/or collocation levels. This served as a foundation for establishing an WSD model by using grammatical information and a thesaurus provided by linguists.
The model uses the Grammatical Knowledge-base of Contemporary Chinese [Yu Shiwen et al. 2002] as one of its main machine-readable dictionaries (MRDs). It can provide rich grammatical information for disambiguation of Chinese lexicons, such as parts-of-speech (POS) and syntax functions.
Another resource of the model is the Semantic Dictionary of Contemporary Chinese [Wang Hui et al. 1998], which provides a thesaurus and semantic collocation information of more than 20,000 nouns. They were employed to analyze 635 Chinese polysemous nouns.
By making full use of these two MRD resources and a very large POS-tagged corpus of Mandarin Chinese, a multi-level WSD model based on syntagmatic features was developed. The experiment described at the end of the paper verifies that the approach achieves high levels of efficiency and precision.
Word Sense Disambiguation, syntagmatic features, noun sense, Chinese Language Information Processing
亢世勇 (Shiyong Kang)
We introduce the development of the Electronic Lexicon of Contemporary Newborn Chinese Words: (1) the definition of a newborn word, (2) the main principle behind constructing the lexicon, (3) the collection of newborn words and their feature descriptions of them, and (4) the classification of 40,000 newborn words. In our opinion, a new bornword is a character string that appeared after 1978 in a new form, with a new meaning and with a new usage. In addition, it must be frequently used and accepted, but the names of men and places are not newborn words according to our definition. The approach to collecting newborn words is quite unrestricted, that is, the more the better. Based on the Contemporary Chinese Grammatical Knowledge Base of the Institute of Computational Linguistics at Peking University, we have finished compiling a lexicon of almost 40,000 newborn words semi-automatically. The lexicon, we believe, is a worthy resource for research on Chinese word-building rules and Natural Language Processing. Firstly, classification is done based on the preponderant grammatical characteristics of each word, and then the detailed features are described in the database of ACCESS. The lexicon contains a total base and three grammatical bases (i.e., a noun base, verb base and adjective base); what’s more, it also has an old word base, a loanword base and a acronym base. The entire base is related to the sub-bases through the fields of word, phonetic notation and semantics fields, which form a hypernymy hierarchy that is quite convenient for searching. Totally, there are more than 200 fields in the bases that give information regarding phonetic notation, semantics, sources, word building, syntax and pragmatics. Without doubt, this lexicon is one of the largest domestic lexicons available with the most detailed descriptions of newborn Chinese words.
Chinese information processing, New words, Electronic dictionary
宋柔､許勇 (Song Rou, Xu Yong)
The typical approaches to extracting text knowledge are sentential parsing and pattern matching. Theoretically, text knowledge extraction should be based on complete understanding, so the technology of sentential parsing is used in the field. However, the fragility of systems and highly ambiguous parse results are serious problems. On the other hand, by avoiding thorough parsing, pattern matching becomes highly efficient. However, different expressions of the same information will dramatically increase the number of patterns and nullify the simplicity of the approach.
Parsing in Chinese encounters greater barriers than that in English does. Firstly, Chinese lacks morphology. For example, recognition of base-NP in Chinese is more difficult than that in English because its left boundary is hard to discern. Secondly, there are many stream sentences in Chinese which lack subjects and cause parsing to fail. Finally, in Chinese, the absence of verbs is also pervasive. Sentential parsing centering on verbs, which is used with English, is not always successful with Chinese.
We are engaged in research on knowledge extraction from the Electronic Chinese Great Encyclopedia. Our goal is to extract unstructured knowledge from it and to generate a well-structured database so as to provide information services to users. The pattern-matching approach is adopted.
The experiment was divided into two steps: (1) classifying entries based on lexicon semantics; (2) establishing a formal system based on lexicon semantics and extracting knowledge by means of pattern matching.
Classification of entries is important because in the text of the entries of different categories there are different kinds of patterns expressing knowledge. Our experiment demonstrated that an entry of the encyclopedia can be classified precisely merely according to the characters in the entry and the words in the first sentence of the entry’s text. Some specific categories, e.g., organization names and Chinese place names, can be classified satisfactorily merely according to the suffix of the entry, for suffixes are closely related with semantic categories in Chinese.
The formal system designed for knowledge extraction consists of 4 kinds of meta knowledge: concepts, mapping, relations and rules, which reflect lexicon semantic attributes. The present experiment focused on the extraction of knowledge about various areas from the texts regarding administrative places of China (how large is a place or its subdivisions). The results of the experiment show that the design of the formal system is practical. It can accurately and completely denote various expressions of simple knowledge in a Chinese encyclopedia. However, when the focus of knowledge changes, e.g., from administrative areas to habits of animals, it is a labor-intensive task to renew the formal system. Therefore the study of auto or semi-auto generation of this kind of formal system is required.