International Journal of Computational Linguistics & Chinese Language Processing
Vol. 3, No. 1, February 1998


Title:
 
Analyzing the Performance of Message Understanding Systems

Author:
Amit Bagga, Alan W. Biermann

Abstract:
In this paper we describe a method of classifying facts (information) into categories or levels; where each level signifies a different degree of difficulty of extracting the fact from a piece of text containing it. Based on this classification mechanism, we propose a method of evaluating a domain by assigning to it a "domain number'' based on the levels of a set of standard facts present in the articles of that domain. We then use the classification mechanism to analyze the performances of three MUC systems (BBN, NYU, and SRI) based on their ability to extract a set of standard facts (at different levels) from two different MUC domains. This analysis is then extended to analyze the role of conferencing in the performance of message understanding systems.

The evaluation of a domain based on the "domain number'' assigned to it is a big step up from methods used earlier (which used vocabulary size, average sentence length, the number of sentences per document, etc.). Moreover, the use of the classification mechanism as a tool to analyze the performance of message understanding systems provides a deeper insight into these systems than the one provided by obtaining the precision and recall statistics of each system.

Keyword:
Information Extraction, Domain Complexity, Analysis of Systems, Message Understanding Conferences


Title:
 
Unknown Word Detection for Chinese by a Corpus-based Learning Method

Author:
Keh-Jiann Chen, Ming-Hong Bai

Abstract:
One of the most prominent problems in computer processing of the Chinese language is identification of the words in a sentence. Since there are no blanks to mark word boundaries, identifying words is difficult because of segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words). In this paper, a corpus-based learning method is proposed which derives sets of syntactic rules that are applied to distinguish monosyllabic words from monosyllabic morphemes which may be parts of unknown words or typographical errors. The corpus-based learning approach has the advantages of: 1. automatic rule learning, 2. automatic evaluation of the performance of each rule, and 3. balancing of recall and precision rates through dynamic rule set selection. The experimental results show that the rule set derived using the proposed method outperformed hand-crafted rules produced by human experts in detecting unknown words.


Title:
 
Meaning Representation and Meaning Instantiation for Chinese Nominals

Author:
Kathleen Ahrens, Li-li Chang, Ke-jiann Chen, Chu-Ren Huang

Abstract:
The goal of this paper is to explicate the nature of Chinese nominal semantics, and to create a paradigm for nominal semantics in general that will be useful for natural language processing purposes. We first point out that a lexical item may have two meanings simultaneously, and that current models of lexical semantic representation cannot handle this phenomenon. We then propose a meaning representation that deals with this problem, and also discuss how the meanings involved are instantiated. In particular we posit that in addition to the traditional notion of sense differentiation, each sense may have different meaning facets. These meaning facets are linked to their sense or to other meaning facets through one of two ways: meronymic or metonymic extension.


Title:
 
Towards a Representation of Verbal Semantics -- An Approach Based on Near-Synonyms

Author:
Mei-Chih Tsai, Chu-Ren Huang, Keh-Jiann Chen, Kathleen Ahrens

Abstract:
In this paper we propose using the distributional differences in the syntactic patterns of near-synonyms to deduce the relevant components of verb meaning. Our method involves determining the distributional differences in syntactic patterns, deducing the semantic features from the syntactic phenomena, and testing the semantic features in new syntactic frames. We determine the distributional differences in syntactic patterns through the following five steps: First, we search for all instances of the verb in the corpus. Second, we classify each of these instances into its type of syntactic function. Third, we classify each of these instances into its argument structure type. Fourth, we determine the aspectual type that is associated with each verb. Lastly, we determine each verb's sentential type. Once the distributional differences have been determined, then the relevant semantic features are postulated. Our goal is to tease out the lexical semantic features as the explanation, and as the motivation of the syntactic contrasts.


Title:
 
White Page Construction from Web Pages for Finding People on the Internet

Author:
Hsin-Hsi Chen, Guo-Wei Bian

Abstract:
This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The information (i.e., home pages' URLs or e-mail addresses) for those proper nouns appearing in the anchor parts can be easily extracted using the associated anchor tags. For those proper nouns in the non-anchor part of a web page, different kinds of clues, such as the spelling method, adjacency principle and HTML tags, are used to relate proper nouns to their corresponding E-mail addresses and/or URLs. Based on the semantics of content and HTML tags, the extracted information is more accurate than the results obtained using traditional search engines. The results can be used to construct white pages for Internet/Intranet users or to build databases for finding people and organizations on the Internet. Such searching services are very useful for human communication and dissemination of information.

Keyword:
proper name identification, information extraction, white pages, World Wide Web


Title:
 
Human Judgment as a Basis for Evaluation of Discourse-Connective-Based Full-Text Abstraction in Chinese

Author:
Benjamin K T'sou, Hing-Lung Lin, Tom B Y Lai, Samuel W K Chan

Abstract:
In Chinese text, discourse connectives constitute a major linguistic device available for a writer to explicitly indicate the structure of a discourse. This set of discourse connectives, consisting of a few hundred entries in modern Chinese, is relatively stable and domain independent. In a recently published paper [T'sou 1996], a computational procedure was introduced to generate the abstract of an input text using mainly the discourse connectives appearing in the text. This paper attempts to demonstrate the validity of this approach to full-text abstraction by means of an evaluation method, which compares human efforts in text abstraction with the performance of an experimental system called ACFAS. Specifically, our concern is about the relationship between the perceived importance of each individual sentence as judged by human beings and the sentences containing discourse connectives within an argumentative discourse.

Keyword:
text abstraction, discourse connectives, performance evaluation, experiment design, correlation analysis