International Journal of Computational Linguistics & Chinese Language Processing
Vol. 6, No. 1, February 2001

Improving Translation Selection with a New Translation Model Trained by Independent Monolingual Corpora
Ming Zhou, Ding Yuan, Changning Huang
[pdf | html]
The Use of Clustering Techniques for Language Modeling- Application to Asian Language
Jianfeng Gao, Joshua T. Goodman, Jiangbo Miao
[pdf | html]
Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts
Min Chu, Yao Qian
[pdf | html]
Automatic Translation Template Acquisition Based on Bilingual Structure Alignment
Yajuan Lü, Ming Zhou, Sheng Li, Changning Huang, Tiejun Zhao
[pdf | html]
Improving the Effectiveness of Information Retrieval with Clustering and Fusion
Jian Zhang, Jianfeng Gao, Ming Zhou, Jiaxing Wang
[pdf | html]

Title:
Improving Translation Selection with a New Translation Model Trained by Independent Monolingual Corpora

Author:
Ming Zhou, Yuan Ding, Changning Huang

Abstract:
We propose a novel statistical translation model to improve translation selection of collocation. In the statistical approach that has been popularly applied for translation selection, bilingual corpora are used to train the translation model. However, there exists a formidable bottleneck in acquiring large-scale bilingual corpora, in particular for language pairs involving Chinese. In this paper, we propose a new approach to training the translation model by using unrelated monolingual corpora. First, a Chinese corpus and an English corpus are parsed with dependency parsers, respectively, and two dependency triple databases are generated. Then, the similarity between a Chinese word and an English word can be estimated using the two monolingual dependency triple databases with the help of a simple Chinese-English dictionary. This cross-language word similarity is used to simulate the word translation probability. Finally, the generated translation model is used together with the language model trained with the English dependency database to realize translation of Chinese collocations into English. To demonstrate the effectiveness of this method, we performed various experiments with verb-object collocation translation. The experiments produced very promising results.

Keyword:
Translation selection, Statistical machine translation, Chinese-English machine translation, Cross language word similarity

Title:
The Use of Clustering Techniques for Language Modeling – Application to Asian Language

Author:
Jianfeng Gao, Joshua T. Goodman, Jiangbo Miao

Abstract:
Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achieve more than 40% size reduction at the same level of perplexity.

Title:
Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts

Author:
Min Chu, Yao Qian

Abstract:
This paper proposes a three-tier prosodic hierarchy, including prosodic word, intermediate phrase and intonational phrase tiers, for Mandarin that emphasizes the use of the prosodic word instead of the lexical word as the basic prosodic unit. Both the surface difference and perceptual difference show that this is helpful for achieving high naturalness in text-to-speech conversion. Three approaches, the basic CART approach, the bottom-up hierarchical approach and the modified hierarchical approach, are presented for locating the boundaries of three prosodic constituents in unrestricted Mandarin texts. Two sets of features are used in the basic CART method: one contains syntactic phrasal information and the other does not. The one with syntactic phrasal information results in about a 1% increase in accuracy and an 11% decrease in error-cost. The performance of the modified hierarchical method produces the highest accuracy, 83%, and lowest error cost when no syntactic phrasal information is provided. It shows advantages in detecting the boundaries of intonational phrases at locations without breaking punctuation. 71.1% precision and 52.4% recall are achieved. Experiments on acceptability reveal that only 26% of the mis-assigned break indices are real infelicitous errors, and that the perceptual difference between the automatically assigned break indices and the manually annotated break indices are small.

Title:
Automatic Translation Template Acquisition Based on Bilingual Structure Alignment

Author:
Yajuan Lü, Ming Zhou , Sheng Li ,Changning Huang , Tiejun Zhao

Abstract:
Knowledge acquisition is a bottleneck in machine translation and many NLP tasks. A method for automatically acquiring translation templates from bilingual corpora is proposed in this paper. Bilingual sentence pairs are first aligned in syntactic structure by combining a language parsing with a statistical bilingual language model. The alignment results are used to extract translation templates which turn out to be very useful in real machine translation.

Keyword:
Bilingual corpus, Translation template acquisition, Structure alignment, Machine translation

Title:
Improving the Effectiveness of Information Retrieval with Clustering and Fusion

Author:
Jian Zhang, Jianfeng Gao, Ming Zhou, Jiaxing Wang

Abstract:
Fusion and clustering are two approaches to improving the effectiveness of information retrieval. In fusion, ranked lists are combined together by various means. The motivation is that different IR systems will complement each other, because they usually emphasize different query features when determining relevance and retrieve different sets of documents. In clustering, documents are clustered either before or after retrieval. The motivation is that similar documents tend to be relevant to the same query so that this approach is likely to retrieve more relevant documents by identifying clusters of similar documents. In this paper, we present a novel fusion technique that can be combined with clustering to achieve consistent improvements over conventional approaches. Our method involves three steps: (1) clustering similar documents, (2) re-ranking retrieval results, and (3) combining retrieval results.