International Journal of Computational Linguistics & Chinese Language Processing                                   [中文]
                                                                                          Vol. 13, No. 2, June 2008


Title:
Multiple Document Summarization Using Principal Component Analysis Incorporating Semantic Vector Space Model

Author:
Om Vikas, Akhil K Meshram, Girraj Meena, and Amit Gupta

Abstract:
Text Summarization is very effective in relevant assessment tasks. The Multiple Document Summarizer presents a novel approach to select sentences from documents according to several heuristic features. Summaries are generated modeling the set of documents as Semantic Vector Space Model (SVSM) and applying Principal Component Analysis (PCA) to extract topic features. Pure Statistical VSM assumes terms to be independent of each other and may result in inconsistent results. Vector space is enhanced semantically by modifying the weight of the word vector governed by Appearance and Disappearance (Action class) words. The knowledge base for Action words is maintained by classifying the words as Appearance or Disappearance with the help of Wordnet. The weights of the action words are modified in accordance with the Object list prepared by the collection of nouns corresponding to the action words. Summary thus generated provides more informative content as semantics of natural language has been taken into consideration.

Keywords: Principal Component Analysis (PCA), Semantic Vector Space Model (SVSM), Summarization, Topic Feature, Wordnet


Title:
A Study on Consistency Checking Method of Part-Of-Speech Tagging for Chinese Corpora

Author:
Hu Zhang, and Jiaheng Zheng

Abstract:
Ensuring consistency of Part-Of-Speech (POS) tagging plays an important role in the construction of high-quality Chinese corpora. After having analyzed the POS tagging of multi-category words in large-scale corpora, we propose a novel classification-based consistency checking method of POS tagging in this paper. Our method builds a vector model of the context of multi-category words along with using the k-NN algorithm to classify context vectors constructed from POS tagging sequences and to judge their consistency. These methods are evaluated on our 1.5M-word corpus. The experimental results indicate that the proposed method is feasible and effective.

Keywords:
Multi-Category Words, Consistency Checking, Part of Speech Tagging, Chinese Corpus, Classification


Title:
Constructing a Temporal Relation Tagged Corpus of Chinese Based on Dependency Structure Analysis

Author:
Yuchang CHENG, Masayuki ASAHARA, and Yuji MATSUMOTO

Abstract:
This paper describes an annotation guideline for a temporal relation-tagged corpus of Chinese. Our goal is construction of corpora to be used for a corpus-based analysis of temporal relations among events. Since annotating all combinations of events is inefficient, we examine the use of dependency structure to efficiently recognize temporal relations. We annotate a part of Treebank based on our guidelines. Then, we survey a small tagged data set to investigate the coverage of our method. While we find that use of dependency structure drastically reduces manual effort in constructing a tagged corpus with temporal relations, the coverage of the methods achieves about 63%.

Keywords:
Temporal Entities, Event Entities, Temporal Reasoning, Event Semantics, Dependency Structure


Title:
The Effects of Formal Schema on Reading Comprehension—An Experiment with Chinese EFL Readers

Author:
Xiaoyan Zhang

Abstract:
This study attempts to explore the effects of formal schemata or rhetorical patterns on reading comprehension through detailed analysis of a case study of 45 non-English majors from X University. The subjects were selected from three classes of comparable English level and were divided into three groups. Each group was asked to recall the text and finish a cloze test after reading one of three versions of a passage with identical content but different formal schemata: description schema, comparison and contrast schema, and problem-solution schema. Both quantitative and qualitative analyses of the recall protocol indicate that subjects displayed better recall of the text with highly structured schema than the one with loosely controlled schema, which suggests that formal schemata has a significant effect on written communication and the teaching of formal schemata to students is necessary to enhance their writing ability.

Keywords:
Formal Schema, Schema Theory, Reading Comprehension


Title:
A Cross-Linguistic Study of Voice Onset Time in Stop Consonant Productions

Author:
Kuan-Yi Chao, and Li-mei Chen

Abstract:
This study examines voice onset time (VOT) for phonetically voiceless word-initial stops in Mandarin Chinese and in English, as spoken by 11 Mandarin speakers and 4 British English speakers. The purpose of this paper is to compare Mandarin and English VOT patterns and to categorize their stop realizations along the VOT continuum. As expected, the findings reveal that voiceless aspirated stops in Mandarin and in English occur at different places along the VOT continuum and the differences reach significance. The results also suggest that the three universal VOT categories (i.e. long lead, short lag, and long lag) are not fine enough to distinguish the voiceless stops of these two languages.

Keywords:
Voice Onset Time (VOT), Voiceless Stops
 


Title:
Data Driven Approaches to Phonetic Transcription with Integration of Automatic Speech Recognition and Grapheme-to-Phoneme for Spoken Buddhist Sutra

Author:
Min-Siong Liang, Ren-Yuan Lyu, and Yuang-Chin Chiang

Abstract:
We propose a new approach for performing phonetic transcription of text that utilizes automatic speech recognition (ASR) to help traditional grapheme-to-phoneme (G2P) techniques. This approach was applied to transcribe Chinese text into Taiwanese phonetic symbols. By augmenting the text with speech and using automatic speech recognition with a sausage searching net constructed from multiple pronunciations of text, we are able to reduce the error rate of phonetic transcription. Using a pronunciation lexicon with multiple pronunciations for each item, a transcription error rate of 12.74% was achieved. Further improvement can be achieved by adapting the pronunciation lexicon with pronunciation variation (PV) rules derived manually from corrected transcription in a speech corpus. The PV rules can be categorized into two kinds: knowledge-based and data-driven rules. By incorporating the PV rules, an error rate of 10.56% could be achieved. Although this technique was developed for Taiwanese speech, it could easily be adapted to other Chinese spoken languages or dialects.

Keywords:
Automatic Phonetic Transcription, Phone Recognition, Grapheme-to-Phoneme (G2P), Pronunciation Variation, Chinese Text, Taiwanese (Min-Nan), Dialect, Buddhist Sutra.