Author:
Wei-Yun Ma, and Kathleen McKeown,
Abstract:
Statistical machine translation has made tremendous progress over the past ten years. The output of even the best systems, however, is often ungrammatical because of the lack of sufficient linguistic knowledge. Even when systems incorporate syntax in the translation process, syntactic errors still result. To address this issue, we present a novel approach for detecting and correcting ungrammatical translations. In order to simultaneously detect multiple errors and their corresponding words in a formal framework, we use feature-based lexicalized tree adjoining grammars, where each lexical item is associated with a syntactic elementary tree, in which each node is associated with a set of feature-value pairs to define the lexical item�䏭 syntactic usage. Our syntactic error detection works by checking the feature values of all lexical items within a sentence using a unification framework. In order to simultaneously detect multiple error types and track their corresponding words, we propose a new unification method which allows the unification procedure to continue when unification fails and also to propagate the failure information to relevant words. Once error types and their corresponding words are detected, one is able to correct errors based on a unified consideration of all related words under the same error types. In this paper, we present some simple mechanism to handle part of the detected situations. We use our approach to detect and correct translations of six single statistical machine translation systems. The results show that most of the corrected translations are improved.
Keywords:
Machine Translation, Syntactic Error, Post Editing, Tree Adjoining Grammar, Feature Unification
Author:
Long-Yue WANG, Derek F. WONG, and Lidia S. CHAO
Abstract:
This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models:
Translation model, Query generation model, Document retrieval model and
Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.
Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.
In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.
After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.
This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.
Keywords:
Cross-Language Document Retrieval, Statistical Machine Translation, TF-IDF, Document Translation-Based, Length-Based Filter
Author:
Ho-Cheng Yu, Ting-Hao (Kenneth) Huang, and Hsin-Hsi Chen
Abstract:
The researches of sentiment analysis aim at exploring the emotional state of writers. The analysis highly depends on the application domains. Analyzing sentiments of the articles in different domains may have different results. In this study, we focus on corpora from three different domains in Traditional and Simplified Chinese including real estate, hotel and restaurant, then examine the polarity degrees of vocabularies in these three domains, and propose methods to capture sentiment differences. Finally, we apply the results to sentiment classification with LIBSVM (linear kernel). The experiments show that the proposed method TF-S-S-IDF which integrates TF-IDF, NTU Sentiment Dictionary, and word sentiment orientation degree in each specific domain can effectively improve the sentiment classification performance.
Keywords:
Document Sentiment Classification, Word Polarity Analysis, Machine Learning
Author:
Wan-Chen Lin, Tsung-Ting Kuo, Tung-Jia Chang, Chueh-An Yen, Chao-Ju Chen and Shou-de Lin
Abstract:
This paper exploits machine learning methods to separate robbery and intimidation cases, and predicting their sentencing by considering defined legal factors. We introduce a framework to fetch 21 legal factor labels of robbery and intimidation cases, then use the labels for case classification and sentencing prediction. Our experiments show that the legal factor labels can indeed improve the results of case classification and sentencing prediction. We then discuss the influence of these legal factors in both case classification and sentencing prediction tasks.
Keywords:
Case Classification, Sentencing Prediction, Robbery, Intimidation
Author:
Hsin-Ju Hsieh, Jeih-weih Hung, and Berlin Chen
Abstract:
Histogram equalization (HEQ) of speech features has received considerable attention in the field of robust speech recognition due to its simplicity and excellent performance. This paper is a continuation of this general line of research, presenting a novel HEQ-based feature normalization framework which takes advantage of joint equalization of spatial-temporal contextual statistics of speech features. In doing so, we explore the use of simple differencing and averaging operations to capture the contextual statistics of feature vector components for speech feature normalization. All experiments are conducted on the Aurora-2 database and task. Experimental results show that for clean-condition training, the methods instantiated from this framework achieve considerable word error rate reductions over the baseline system, which are indeed quite comparable to other conventional methods.
Keywords:
Speech Recognition, Noise Robustness, Histogram Equalization, Feature Contextual Statistics