International Journal of Computational Linguistics & Chinese Language Processing                                   [中文]
                                                                                          Vol. 11, No. 4, December 2006


Title:
Tokenization and Morphological Analysis for Malagasy

Author:
Mary Dalrymple, Maria Liakata, and Lisa Mackie

Abstract:
The authors present a tokenizer and finite-state morphological analyzer [Beesley and Karttunen 2003] for Malagasy, based primarily on the discussion of Malagasy morphology in Keenan and Polinsky [1998] and Randriamasimanana [1986]. Words in Malagasy are built from roots by means of a variety of morphological opera¬tions such as compounding, affixation and reduplication. The authors analyze productive patterns of nominal and verbal morphology, and describe genitive compounding and suffixation for nouns and various derivational processes involving compounding and affixation for verbs. This work offers a computational analysis of Malagasy morphology, and forms the basis of a computational grammar and lexicon of Malagasy within the framework of the PARGRAM project.

Keyword: Malagasy, Austronesian, Morphological Analyzer, Fnite-State Morphology, PARGRAM


Title:
Multiply Quantified Internally Headed Relative Clause in Japanese: A Skolem Term Based Approach

Author:
Rui Otake, and Kei Yoshimoto

Abstract:
This paper presents an analysis of Internally Headed Relative Clause (IHRC) construction in Japanese within the framework of Combinatory Categorial Grammar [Steedman 2000]. Shimoyama [1999] argues that when an IHRC appears within the scope of a universal quantifier, the interpretation of the IHRC exemplifies E-type anaphora and that the LF representation of the IHRC should have a variable bound by the quantifier in the matrix clause. To accommodate this argument Shimoyama posits a free variable of a functional type to which the bound variable is applied, and whose denotation is determined by the context-dependent assignment function. However, since there is in principle no limit to the number of quantifiers in the matrix clause (and accordingly that of bound variables in the IHRC), the semantic type of the free variable would be highly ambiguous if the IHRC occurs within the scope of multiple quantifiers. The current analysis assumes that the interpretation of IHRCs exhibits an instance of generalized Skolem term [Steedman 2005], a term whose denotation varies with the value of bound variables introduced by scope-taking operators, but which is interpreted as a constant in the absence of such operators. This paper provides a straightforward account for the semantics of the construction without invoking the complexities of the type ambiguity of free variables.

Keyword:
Combinatory Categorial Grammar, Generalized Skolem Term, Internally Headed Relative Clause, Japanese, Quantification


Title:
Data Management in QRLex, an Online Aid System for Volunteer Translators'

Author:
Youcef Bey, Kyo Kageura, and Christian Boitet

Abstract:
This paper proposes a new framework for a system which will help online volunteers to perform translations on their PCs while sharing resources and tools and communicating via websites. The current status of such online volunteer translators and their translation practices and tools are examined, along with related work also being discussed. General requirements are derived from these considerations. The approach taken in this study for dealing with heterogeneous linguistic resources relies on an XML structure maximizing efficiency and enabling all of the desired functionalities. The QRLex environment is under development and implements this new framework.

Keyword:
Computer-Aided Translation, Web Search for Translation, Memory Translation, Helping Volunteer Translators, Linguistique Ressources


Title:
Using a Small Corpus to Test Linguistic Hypotheses: Evaluating ‘People’ in the State of the Union Addresses

Author:
Kathleen Ahrens

Abstract:
This paper argues that small corpora are useful in testing specific linguistic hypotheses, particularly those dealing with rhetoric, stylistics, and sociolinguistics. In particular, we hypothesize that creating a database of U.S. presidential speeches will allow for a diachronic exploration of language use at the highest political level, and enable a contrast to be drawn between legislative advances for minorities in the United States and the integration of those advances into the presidential lexicon. In order to test this hypothesis, we examine the corpora of State of the Union Addresses from 1945 to 2006. We demonstrate that while there was clearly a shift two decades ago to systematically portraying human beings as being made up of two genders, or being subsumed under a gender-neutral term, other aspects of gender, such as parenthood, are still stereotyped by American presidents. In short, analyzing lexical instances related to ‘people’ in the State of the Union address allows us not only to reflect on the values held by U.S. presidents, but also to systematically uncover how they use language to exercise power on the very people they are elected to serve.

Keywords:
Small Corpora, Politics, Language, Gender, Diachronic Analysis


Title:
A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models

Author:
Wei Jiang, Yi Guan, and Xiao-Long Wang

Abstract:
A pragmatic Chinese word segmentation approach is presented in this paper based on mixing language models. Chinese word segmentation is composed of several hard sub-tasks, which usually encounter different difficulties. The authors apply the corresponding language model to solve each special sub-task, so as to take advantage of each model. First, a class-based trigram is adopted in basic word segmentation, which applies the Absolute Discount Smoothing algorithm to overcome data sparseness. The Maximum Entropy Model (ME) is also used to identify Named Entities. Second, the authors propose the application of rough sets and average mutual information, etc. to extract special features. Finally, some features are extended through the combination of the word cluster and the thesaurus. The authors’ system participated in the Second International Chinese Word Segmentation Bakeoff, and achieved 96.7 and 97.2 in F-measure in the PKU and MSRA open tests, respectively.

Keyword:
Word Segmentation, N-Gram, Maximum Entropy Model, Rough Sets, Word Cluster, Machine Learning