Author:
Xiwu Han and Tiejun Zhao
Abstract:
Automatically acquired lexicons with subcategorization information have been shown to be accurate and useful for some purposes, but their accuracy still shows room for improvement and their usefulness in many applications remains to be investigated. This paper proposes a two-fold filtering method, which in experiments improved the performance of a Chinese acquisition system remarkably, with an increase in the precision rate of 76.94% and in the recall rate of 83.83%, making the acquired lexicon much more practical for further manual proofreading and other NLP uses. And as far as we know, at the present time, these figures represent the best overall performance achieved in Chinese subcategorization acquisition and in similar researches focusing on other languages.
Keyword:
Filter, Chinese, SCF, Diathesis Alternation
Author:
Guang-Lu Sun, Chang-Ning Huang, Xiao-Long Wang, and Zhi-Ming Xu
Abstract:
This paper presents a new Chinese chunking method based on maximum entropy Markov models. We firstly present two types of Chinese chunking specifications and data sets, based on which the chunking models are applied. Then we describe the hidden Markov chunking model and maximum entropy chunking model. Based on our analysis of the two models, we propose a maximum entropy Markov chunking model that combines the transition probabilities and conditional probabilities of states. Experimental results for two types of data sets show that this approach achieves impressive accuracy in terms of the F-score: 91.02% and 92.68%, respectively. Compared with the hidden Markov chunking model and maximum entropy chunking model, based on the same data set, the new chunking model achieves better performance.
Keyword:
Chinese Chunking, Maximum Entropy Markov Models, Chunking Specification, Feature Template, Smoothing Algorithm
Author:
Yan Wu, Xiukun Li and Caesar Lun
Abstract:
In this paper, we present an integrated method to machine translation from Cantonese to English text. Our method combines example-based and rule-based methods that rely solely on example translations kept in a small Example Base (EB). One of the bottlenecks in example-based Machine Translation (MT) is a lack of knowledge or redundant knowledge in its bilingual knowledge base. In our method, a flexible comparison algorithm, based mainly on the content words in the source sentence, is applied to overcome this problem. It selects sample sentences from a small Example Base. The Example Base only keeps Cantonese sentences with different phrase structures. For the same phrase structure sentences, the EB only keeps the most simple sentence. Target English sentences are constructed with rules and bilingual dictionaries. In addition, we provide a segmentation algorithm for MT. A feature of segmentation algorithm is that it not only considers the source language itself but also its corresponding target language. Experimental results show that this segmentation algorithm can effectively decrease the complexity of the translation process.
Keyword:
Example-Based Machine Translation (EBMT), Rule-Based Machine Translation (RBMT), Example Base (EB)
Author:
Bin Ma and Haizhou Li
Abstract:
In this paper, we compare four typical spoken language identification (LID) systems. We introduce a novel acoustic segment modeling approach for the LID system frontend. It is assumed that the overall sound characteristics of all spoken languages can be covered by a universal collection of acoustic segment models (ASMs) without imposing strict phonetic definitions. The ASM models are used to decode spoken utterances into strings of segment units in parallel phone recognition (PPR) and universal phone recognition (UPR) frontends. We also propose a novel approach to LID system backend design, where the statistics of ASMs and their co-occurrences are used to form ASM-derived feature vectors, in a vector space modeling (VSM) approach, as opposed to the traditional language modeling (LM) approach, in order to discriminate between individual spoken languages. Four LID systems are built to evaluate the effects of two different frontends and two different backends. We evaluate the four systems based on the 1996, 2003 and 2005 NIST Language Recognition Evaluation (LRE) tasks. The results show that the proposed ASM-based VSM framework reduces the LID error rate quite significantly when compared with the widely-used parallel PRLM method. Among the four configurations, the PPR-VSM system demonstrates the best performance across all of the tasks.
Keywords:
Automatic Language Identification, Acoustic Segment Models, Universal Phone Recognizer, Parallel Phone Recognizers, Vector Space Modeling