International Journal of Computational Linguistics & Chinese Language Processing
Vol. 8, No. 1, February 2003


Title:
Customizable Segmentation of Morphologically Derived Words in Chinese

Author:
Andi Wu

Abstract:
The output of Chinese word segmentation can vary according to different linguistic definitions of words and different engineering requirements, and no single standard can satisfy all linguists and all computer applications. Most of the disagreements in language processing come from the segmentation of morphologically derived words (MDWs). This paper presents a system that can be conveniently customized to meet various user-defined standards in the segmentation of MDWs. In this system, all MDWs contain word trees where the root nodes correspond to maximal words and leaf nodes to minimal words. Each non-terminal node in the tree is associated with a resolution parameter which determines whether its daughters are to be displayed as a single word or separate words. Different outputs of segmentation can then be obtained from the different cuts of the tree, which are specified by the user through the different value combinations of those resolution parameters. We thus have a single system that can be customized to meet different segmentation specifications.

Keyword:
segmentation standards, morphologically derive words, customizable systems, word-internal structures


Title:
Chinese Word Segmentation as Character Tagging

Author:
Nianwen Xue

Abstract:
In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi, tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach is competitive against other supervised machine-learning segmenters reported in previous studies, achieving precision and recall rates of 95.01% and 94.94% respectively, trained on a 237K-word training set.

Keyword:
Chinese word segmentation, supervised machine-learning, maximum entropy, character tagging


Title:
Measuring and Comparing the Productivity of Mandarin Chinese Suffixes

Author:
Eiji Nishimoto

Abstract:
The present study attempts to measure and compare the morphological productivity of five Mandarin Chinese suffixes: the verbal suffix -hua, the plural suffix -men, and the nominal suffixes -r, -zi, and -tou. These suffixes are predicted to differ in their degree of productivity: -hua and -men appear to be productive, being able to systematically form a word with a variety of base words, whereas -zi and -tou (and perhaps also -r) may be limited in productivity. Baayen [1989, 1992] proposes the use of corpus data in measuring productivity in word formation. Based on word-token frequencies in a large corpus of texts, his token-based measure of productivity expresses productivity as the probability that a new word form of an affix will be encountered in a corpus. We first use the token-based measure to examine the productivity of the Mandarin suffixes. The present study, then, proposes a type-based measure of productivity that employs the deleted estimation method [Jelinek & Mercer, 1985] in defining unseen words of a corpus and expresses productivity by the ratio of unseen word types to all word types. The proposed type-based measure yields the productivity ranking “-men, -hua, -r, -zi, -tou,” where -men is the most productive and -tou is the least productive. The effects of corpus-data variability on a productivity measure are also examined. The proposed measure is found to obtain a consistent productivity ranking despite variability in corpus data.

Keyword:
Mandarin Chinese word formation, Mandarin Chinese suffixes, morphological productivity, corpus-based productivity measure


Title:
Extension of Zipf's Law to Word and Character N-grams for English and Chinese

Author:
Le Quan Ha, E. I. Sicilia-Garcia, Ji Ming and F. J. Smith

Abstract:
It is shown that for a large corpus, Zipf's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf's law approximately with the slope close to -1 on a log-log plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.

Keyword:
Zipf's law, Chinese character, Chinese compound word, n-grams, phrases