International Journal of Computational Linguistics & Chinese Language Processing
Vol. 4, No. 2, August 1999


Title:
A Model for Word Sense Disambiguation

Author:
Li Juanzi, Huang Changning

Abstract:
Word sense disambiguation is one of the most difficult problems in natural language processing. This paper puts forward a model of mapping a structural semantic space from a thesaurus into a multi-dimensional, real-valued vector space and gives a word sense disambiguation method based on this mapping. The model, which uses an unsupervised learning method for acquiring the disambiguation knowledge, not only saves extensive manual work, but also realizes the sense tagging of a large amount of content words. Firstly, a Chinese thesaurus Cilin and a very large-scale corpus are used to construct the structure of the semantic space. Then, a dynamic disambiguation model is developed to disambiguate an ambiguous word according to the vectors of monosemous words in each of its possible categories. In order to resolve the problem of data sparseness, a method is proposed to make the model more robust. Testing results show that the model has a relatively good performance and can also be used for other languages.

Keyword:
Natural language processing, Word sense disambiguation, Unsupervised learning, Vector space, Language modeling


Title:
Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval

Author:
Hsin-Hsi Chen, Guo-Wei Bian and Wen-Cheng Lin

Abstract:
This paper deals with translation ambiguity and target polysemy problems together. Two monolingual balanced corpora are employed to learn word co-occurrence for the purpose of translation ambiguity resolution and augmented translation restrictions for that of target polysemy resolution. Experiments show that the model achieves 62.92% monolingual information retrieval, which is 40.80% better than that of the select-all model. When target polysemy resolution is added, the retrieval performance represents approximately a 10.11% increase over that of the model which resolves translation ambiguity only.

Keyword:
Cross-language information retrieval, Query translation, Translation ambiguity, Target polysemy, Augmented translation restriction


Title:
General Knowledge Annotation Based on How-net (
基於知網的常識知識標注)

Author:
Gan Kok Wee, Tham Wai Mun (
顏國偉, 譚慧敏)

Abstract:
知網是個雙語的常識知識庫,描述概念與概念之間種種不同的關係,包括上下位關係、近義關係、反義關係、部件與整體間的關係、屬性與宿主之間的關係、材料與成品之間的關係、對逆關係、動態角色關係和概念同現關係。本文利用知網標注了三萬目詞的語料。我們的語料來自中央研究院平衡語料庫(第三版)中有關社會犯罪的報章報導。玆將標注方法以及標注過程中所發現的問題和我們的解決方案摘要報告。

Keyword:
Machine Translation, Mandarin, Speech Synthesis, Taiwanese, Min Nan, Tone Sandhi.


Title:
Project Report: Sinica Treebank (
中文句結構樹資料庫的構建)

Author:
Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, Chu-Ren Hunag (
陳鳳儀, 蔡碧芳, 陳克健, 黃居仁)

Abstract:
中文句結構樹資料庫建構(Sinica Treebank)的主要目的是提供中文自然語言處理研究一個具有標記語料庫的研究素材,我們可以從這個中文句結構樹資料庫中抽取語法知識,也藉由語法知識的抽取與瞭解使我們的剖析系統功能更趨完善。本文介紹中文句結構樹資料庫(Sinica Treebank)構建方法和步驟,從五百萬詞的中央研究院平衡語料庫(Sinica Corpus),抽取句子,以訊息為本格位語法(Information - based Case Grammar, ICG)的表達模式為基本架構,經由電腦自動剖析成結構樹,可以盡量維持結構標記的一致性,最後並加以人工修正、檢驗,以維持標記的正確性。對於歧義的句法結構形式及詞類標記,我們也提出處理的原則。