Author:
Yi-Hsuan Chuang, Chao-Lin Liu, and Jing-Shin Chang,
Abstract:
We studied a special case of the translation of English verbs in verb-object pairs. Researchers have studied the effects of the linguistic information of the verbs being translated, and many have reported how considering the objects of the verbs will facilitate the quality of translation. In this study, we took an extreme approach - assuming the availability of the Chinese translation of the English object. In a related exploration, we examined how the availability of the Chinese translation of the English verb influences the translation quality of the English nouns in verb phrases with analogous procedures. We explored the issue with 35 thousand VN pairs that we extracted from the training data obtained from the 2011 NTCIR PatentMT workshop and with 4.8 thousand VN pairs that we extracted from a bilingual version of Scientific American magazine. The results indicated that, when the English verbs and objects were known, the additional information about the Chinese translations of the English verbs (or nouns) could improve the translation quality of the English nouns (or verbs) but not significantly. Further experiments were conducted to compare the quality of translation achieved by our programs and by human subjects. Given the same set of information for translation decisions, human subjects did not outperform our programs, reconfirming that good translations depend heavily on contextual information of wider ranges.
Keywords:
Machine Translation, Feature Comparison, Near Synonyms in Chinese, E-HowNet, Human Judgments
Author:
Chia-Hui Chang, Shu-Yen Lin, Meng-Feng Tsai, Shu-Ping Li, Hsiang-Mei Liao, and Norden E. Huang
Abstract:
In recent years, there are a considerable number of new immigrants in Taiwan. Although these people are in the good position to learn Chinese, the advantages are limited to speaking and listening. Recognizing Chinese characters is a tough task since one has to memorize the shape, meaning and pronunciation at the same time. Therefore, the cost of learning a single character is relatively high compared with other languages in alphabet system. The goal of this study is to make the 80% pictophonetic characters to be organized more systematically such that the pronunciation of most pictophonetic characters can be inferred automatically. We evaluate the importance of Chinese components by considering the pronunciation strength, occurring frequency, and number of strokes using linear sum, product, and harmonic mean, respectively. Furthermore, we discover pronunciation rules by association mining with priority grouping. Three groups of high reliability rules and five groups of high support rules are demonstrated in this paper to show the effectiveness of pronunciation rule discovery.
Keywords:
Picto-phonetic Character, Pronuciation Strength of Phonetic Component, Component-based Teaching Method, Learning Curve, Association Rule
Author:
Mike Tian-Jian Jiang, Cheng-Wei Shih, Ting-Hao Yang,Chan-Hung Kuo, Richard Tzong-Han Tsai and Wen-Lian Hsu
Abstract:
This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based
n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the
6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of
recall, precision, and their harmonic average as F1
measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of
F and FOOV, respectively.
Keywords:
Conditional Random Fields, Word Segmentation, Accessor Variety, Term-contributed Frequency, Term-contributed Boundary
Author:
Chuan-Jie Lin, Jia-Cheng Zhan, Yen-Heng Chen, and Chien-Wei Pao
Abstract:
This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji into the corresponding Traditional Chinese characters. The same method can also be used to detect words written in character variants. After integrating generation rules for various types of special words, as well as their probability models, the F-measure of our word segmentation system rises from 94.16% to 96.06%. Another experiment shows that 83.18% of the 862 Japanese names in a set of 109 human-annotated documents can be successfully detected.
Keywords:
Semantic Chinese Word Segmentation, Japanese Name Identification, Character Variants
Author:
Yu-Yun Chang
Abstract:
This paper explores the relationship between intelligibility and comprehensibility in speech synthesizers, and it designs an appropriate comprehension task for evaluating the speech synthesizers�� comprehensibility. Previous studies have predicted that a speech synthesizer with higher intelligibility will have higher performance in comprehension. Also, since the two most popular speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a control group to the speech synthesizers. The results in the intelligibility test show that natural speech is better than HTS-2008, which, in turn, is much better than the Multisyn system. In the comprehension task, however, all three of the speech systems display minimal differences in the speech comprehension process. This is because the two speech synthesizers have reached the threshold of having enough intelligibility to provide high speech comprehension quality. Therefore, although there is equal comprehensible speech quality between the HTS-2008 and Multisyn systems, the HTS-2008 speech synthesizer is recommended due to its higher intelligibility.
Keywords:
Speech Synthesizers, Intelligibility Evaluation, Comprehension Evaluation, HTS-2008, Multisyn