International Journal of Computational Linguistics & Chinese Language Processing
Vol. 8, No. 2, August 2003


Title:
A Class-based Language Model Approach to Chinese Named Entity Identification

Author:
Jian Sun, Ming Zhou, and Jianfeng Gao

Abstract:
This paper presents a method of Chinese named entity (NE) identification using a class-based language model (LM). Our NE identification concentrates on three types of NEs, namely, personal names (PERs), location names (LOCs) and organization names (ORGs). Each type of NE is defined as a class. Our language model consists of two sub-models: (1) a set of entity models, each of which estimates the generative probability of a Chinese character string given an NE class; and (2) a contextual model, which estimates the generative probability of a class sequence. The class-based LM thus provides a statistical framework for incorporating Chinese word segmentation and NE identification in a unified way. This paper also describes methods for identifying nested NEs and NE abbreviations. Evaluation based on a test data with broad coverage shows that the proposed model achieves the performance of state-of-the-art Chinese NE identification systems.

Keyword:
Named entity identification, class-based language model, contextual model, entity model


Title:
Chinese Named Entity Recognition Using Role Model

Author:
Hua-Ping ZHANG, Qun LIU, Hong-Kui YU, Xue-Qi CHENG, Shuo BAI

Abstract:
This paper presents a stochastic model to tackle the problem of Chinese named entity recognition. In this research, we unify component tokens of named entity and their contexts into a generalized role set, which is like part-of-speech (POS). The probabilities of role emission and transition are acquired after machine learning on a role-labeled data set, which is transformed from a hand-corrected corpus after word segmentation and POS tagging are performed. Given an original string, role Viterbi tagging is employed on tokens segmented in the initial process. Then named entities are identified and classified through maximum matching on the best role sequence. In addition, named entity recognition using role model is incorporated along with the unified class-based bigram model for word segmentation. Thus, named entity candidates can be further selected in the final process of Chinese lexical analysis. Various evaluations conducted using one month of news from the People's Daily and MET-2 data set demonstrate that the role modeled can achieve competitive performance in Chinese named entity recognition. We then survey the relationship between named entity recognition and Chinese lexical analysis via experiments on a 1,105,611-word corpus using comparative cases. It was found that: on one hand, Chinese named entity recognition substantially contributes to the performance of lexical analysis; on the other hand, the subsequent process of word segmentation greatly improves the precision of Chinese named entity recognition. We have applied the role model to named entity identification in our Chinese lexical analysis system, ICTCLAS, which is free software and available at the Open Platform of Chinese NLP (www.nlp.org.cn). ICTCLAS ranked first with 97.58% in word segmentation precision in a recent official evaluation, which was held by the National 973 Fundamental Research Program of China.

Keyword:
Chinese named entity recognition, word segmentation, role model, ICTCLAS


Title:
Building A Chinese WordNet Via Class-Based Translation Model

Author:
Jason S. Chang, Tracy Lin, Geeng-Neng You, Thomas C. Chuang, Ching-Ting Hsieh

Abstract:
Semantic lexicons are indispensable to research in lexical semantics and word sense disambiguation (WSD). For the study of WSD for English text, researchers have been using different kinds of lexicographic resources, including machine readable dictionaries (MRDs), machine readable thesauri, and bilingual corpora. In recent years, WordNet has become the most widely used resource for the study of WSD and lexical semantics in general. This paper describes the Class-Based Translation Model and its application in assigning translations to nominal senses in WordNet in order to build a prototype Chinese WordNet. Experiments and evaluations show that the proposed approach can potentially be adopted to speed up the construction of WordNet for Chinese and other languages.


Title:
Preparatory Work on Automatic Extraction of Bilingual Multi-Word Units from Parallel Corpora

Author:
Boxing Chen and Limin Du

Abstract:
Automatic extraction of bilingual Multi-Word Units is an important subject of research in the automatic bilingual corpus alignment field. There are many cases of single source words corresponding to target multi-word units. This paper presents an algorithm for the automatic alignment of single source words and target multi-word units from a sentence-aligned parallel spoken language corpus. On the other hand, the output can be also used to extract bilingual multi-word units. The problem with previous approaches is that the retrieval results mainly depend on the identification of suitable Bi-grams to initiate the iterative process. To extract multi-word units, this algorithm utilizes the normalized association score difference of multi target words corresponding to the same single source word, and then utilizes the average association score to align the single source words and target multi-word units. The algorithm is based on the Local Bests algorithm supplemented by two heuristic strategies: excluding words in a stop-list and preferring longer multi-word units.

Keyword:
bilingual alignment; multiword unit; translation lexicon; average association score; normalized association score difference;


Title:
從詞網出發的中文複合名詞的語意表達

Author:
柯淑津

Abstract:
WordNet provides plenty of lexical meaning; therefore, it is very helpful in natural language processing research. Each lexical meaning in Princeton WordNet is presented in English. In this work, we attempt to use a bilingual dictionary as the backbone to automatically map English WordNet to a Chinese form. However, we encounter many barriers between the two different languages when we observe the preliminary result for the linkage between English WordNet and the bilingual dictionary. This mapping causes the Chinese translation of the English synonym collection (Synset) to correspond to unstructured Chinese compound words, phrases, and even long string sentence instead of independent Chinese lexical words. This phenomenon violates the aim of Chinese WordNet to take the lexical word as the basic component. Therefore, this research will perform further processing to study this phenomenon.

The objectives of this paper are as follows: First, we will discover core lexical words and characteristic words from Chinese compound words. Next, those lexical words will be expressed by means of conceptual representations. For the core lexical words, we use grammar structure analysis to locate such words. For characteristic words, we use sememes in HowNet to represent their lexical meanings. Certainly, there exists a problem of ambiguity when Chinese lexical words are translated into their lexical meanings. To resolve this problem, we use lexical parts-of-speech and hypernyms of WordNet to reduce the lexical ambiguity. We experimented on nouns, and the experimental results show that sense disambiguation could achieve a 93.8% applicability rate and a 93.5% correct rate.