International Journal of Computational Linguistics & Chinese Language Processing                                   [中文]
                                                                                          Vol. 12, No. 2, June 2007



Title:
Using a Generative Model for Sentiment Analysis

Author:
Yi Hu, Ruzhan Lu, Yuquan Chen, and Jianyong Duan

Abstract:
This paper presents a generative model based on the language modeling approach for sentiment analysis. By characterizing the semantic orientation of documents as “favorable” (positive) or “unfavorable” (negative), this method captures the subtle information needed in text retrieval. In order to conduct this research, a language model based method is proposed to keep the dependent link between a “term” and other ordinary words in the context of a triggered language model: first, a batch of terms in a domain are identified; second, two different language models representing classifying knowledge for every term are built up from subjective sentences; last, a classifying function based on the generation of a test document is defined for the sentiment analysis. When compared with Support Vector Machine, a popular discriminative model, the language modeling approach performs better on a Chinese digital product review corpus by a 3-fold cross-validation. This result motivates one to consider finding more suitable language models for sentiment detection in future research.

Keyword: Sentiment Analysis, Subjective Sentence, Language Modeling, Supervised Learning.


Title:
An Empirical Study of Non-Stationary Ngram Model and its Smoothing Techniques

Author:
Jinghui Xiao, Bingquan Liu and Xiaolong Wang

Abstract:
Recently many new techniques have been proposed for language modeling, such as ME, MEMM and CRF. However, the ngram model is still a staple in practical applications. It is well worthy of studying how to improve the performance of the ngram model. This paper enhances the traditional ngram model by relaxing the stationary hypothesis on the Markov chain and exploiting the word positional information. Such an assumption is made that the probability of the current word is determined not only by history words but also by the words positions in the sentence. The non-stationary ngram model (NS ngram model) is proposed. Several related issues are discussed in detail, including the definition of the NS ngram model, the representation of the word positional information and the estimation of the conditional probability. In addition, three smoothing approaches are proposed to solve the data sparseness problem of the NS ngram model. Several smoothing algorithms are presented in each approach. In the experiments, the NS ngram model is evaluated on the pinyin-to-character conversion task which is the core technique of the Chinese text input method. Experimental results show that the NS ngram model outperforms the traditional ngram model significantly by the exploitation of the word positional information. In addition, the proposed smoothing techniques solve the data sparseness problem of the NS ngram model effectively with great error rate reduction.

Keyword:
Ngram, Stationary Hypothesis, Pinyin-to-character Conversion, Smoothing


Title:
Hierarchical Web Catalog Integration with Conceptual Relationships in a Thesaurus

Author:
Ing-Xiang Chen, Jui-Chi Ho, and Cheng-Zen Yang

Abstract:
Web catalog integration has become an integral aspect of current digital content management for Internet and e-commerce environments. The Web catalog integration problem concerns integration of documents in a source catalog into a destination catalog. Many investigations have focused on flattened (one-dimensional) catalogs, but few works address hierarchical Web catalog integration. This study presents a hierarchical catalog integration (EHCI) approach based on the conceptual thesauri extracted from the source catalog and the destination catalog to improve performance. Experiments involving real-world catalog integration are performed to measure the performance of the improved hierarchical catalog integration scheme. Experimental results demonstrate that the EHCI approach consistently improves the average accuracy performance of each hierarchical category.

Keyword:
Hierarchical catalog integration, conceptual relationships, thesaurus, Support Vector Machines (SVMs)


Title:
MiniJudge: Software for Small-Scale Experimental Syntax

Author:
James Myers

Abstract:
MiniJudge is free online open-source software to help theoretical syntacticians collect and analyze native-speaker acceptability judgments in a way that combines the speed and ease of traditional introspective methods with the power and statistical validity afforded by rigorous experimental protocols. This paper shows why MiniJudge is useful, what it feels like to use it, and how it works.

Keywords:
Syntax, Experimental Linguistics, JavaScript, R, Generalized Linear Mixed Effect Modeling


Title:
Improve Parsing Performance by Self-Learning

Author:
Yu-Ming Hsieh, Duen-Chi Yang, and Keh-Jiann Chen

Abstract:
There are many methods to improve performance of statistical parsers. Resolving structural ambiguities is a major task of these methods. In the proposed approach, the parser produces a set of n-best trees based on a feature-extended PCFG grammar and then selects the best tree structure based on association strengths of dependency word-pairs. However, there is no sufficiently large Treebank producing reliable statistical distributions of all word-pairs. This paper aims to provide a self-learning method to resolve the problems. The word association strengths were automatically extracted and learned by parsing a giga-word corpus. Although the automatically learned word associations were not perfect, the constructed structure evaluation model improved the bracketed f-score from 83.09% to 86.59%. We believe that the above iterative learning processes can improve parsing performances automatically by learning word-dependence information continuously from web.

Keyword:
Parsing, Word association, Knowledge Extraction, PCFG, PoS Tagging, Semantic.
 


Title:
A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition

Author:
Shih-Hsiang Lin, Yao-Ming Yeh, and Berlin Chen

Abstract:
The performance of current automatic speech recognition (ASR) systems often deteriorates radically when the input speech is corrupted by various kinds of noise sources. Quite a few techniques have been proposed to improve ASR robustness over the past several years. Histogram equalization (HEQ) is one of the most efficient techniques that have been used to reduce the mismatch between training and test acoustic conditions. This paper presents a comparative study of various HEQ approaches for robust ASR. Two representative HEQ approaches, namely, the table-based histogram equalization (THEQ) and the quantile-based histogram equalization (QHEQ), were first investigated. Then, a polynomial-fit histogram equalization (PHEQ) approach, exploring the use of the data fitting scheme to efficiently approximate the inverse of the cumulative density function of training speech for HEQ, was proposed. Moreover, the temporal average (TA) operation was also performed on the feature vector components to alleviate the influence of sharp peaks and valleys caused by non-stationary noises. All the experiments were carried out on the Aurora 2 database and task. Very encouraging results were initially demonstrated. The best recognition performance was achieved by combing PHEQ with TA. Relative word error rate reductions of 68% and 40% over the MFCC-based baseline system, respectively, for clean- and multi- condition training, were obtained.

Keyword:
Automatic Speech Recognition, Robustness, Histogram Equalization, Data Fitting, Temporal Average