International Journal of Computational Linguistics & Chinese Language Processing
Vol. 9, No. 2, August 2004


Title:
Toward Constructing A Multilingual Speech Corpus for Taiwanese (Min-nan), Hakka, and Mandarin Chinese

Author:
Ren-Yuan Lyu, Min-Siong Liang, Yuang-Chin Chiang

Abstract:
The Formosa speech database (ForSDat) is a multilingual speech corpus collected at Chang Gung University and sponsored by the National Science Council of Taiwan. It is expected that a multilingual speech corpus will be collected, covering the three most frequently used languages in Taiwan: Taiwanese (Min-nan), Hakka, and Mandarin. This 3-year project has the goal of collecting a phonetically abundant speech corpus of more than 1,800 speakers and hundreds of hours of speech. Recently, the first version of this corpus containing speech of 600 speakers of Taiwanese and Mandarin was finished and is ready to be released. It contains about 49 hours of speech and 247,000 utterances.

Keyword:
Phonetic Alphabet, Pronunciation Lexicon, Phonetically Balanced Word, Speech Corpus


Title:
Multiple-Translation Spotting for Mandarin-Taiwanese Speech-to-Speech Translation

Author:
Jhing-Fa Wang, Shun-Chieh Lin, Hsueh-Wei Yang, and Fan-Min Li

Abstract:
The critical issues involved in speech-to-speech translation are obtaining proper source segments and synthesizing accurate target speech. Therefore, this article develops a novel multiple-translation spotting method to deal with these issues efficiently. Term multiple-translation spotting refers to the task of extracting target-language synthesis patterns that correspond to a given set of source-language spotted patterns in conditional multiple pairs of speech patterns known to be translation patterns. According to the extracted synthesis patterns, the target speech can be properly synthesized by using a waveform segment concatenation-based synthesis method. Experiments were conducted with the languages of Mandarin and Taiwanese. The results reveal that the proposed approach can achieve translation understanding rates of 80% and 76% on average for Mandarin/Taiwanese translation and Taiwanese/Mandarin translation, respectively.

Keyword:
Multiple-Translation Spotting, Speech-to-Speech Translation


Title:
Latent Semantic Modeling and Smoothing of Chinese Language

Author:
Jen-Tzung Chien, Meng-Sung Wu, and Hua-Jui Peng

Abstract:
Language modeling plays a critical role for automatic speech recognition. Typically, the n-gram language models suffer from the lack of a good representation of historical words and an inability to estimate unseen parameters due to insufficient training data. In this study, we explore the application of latent semantic information (LSI) to language modeling and parameter smoothing. Our approach adopts latent semantic analysis to transform all words and documents into a common semantic space. The word-to-word, word-to-document and document-to-document relations are, accordingly, exploited for language modeling and smoothing. For language modeling, we present a new representation of historical words based on retrieval of the most  relevant document. We also develop a novel parameter smoothing method, where the language models of seen and unseen words are estimated by interpolating the k nearest seen words in the training corpus. The interpolation coefficients are determined according to the closeness of words in the semantic space. As shown by experiments, the proposed modeling and smoothing methods can significantly reduce the perplexity of language models with moderate computational cost.

Keyword:
language modeling, parameter smoothing, speech recognition, and latent semantic analysis.


Title:
Multi-Modal Emotion Recognition from Speech and Text

Author:
Ze-Jing Chuang and Chung-Hsien Wu

Abstract:
This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-three acoustic features are extracted from the speech input. After Principle Component Analysis (PCA) is performed, 14 principle components are selected for discriminative representation. In this representation, each principle component is the combination of the 33 original acoustic features and forms a feature subspace. Support Vector Machines (SVMs) are adopted to classify the emotional states. In text analysis, all emotional keywords and emotion modification words are manually defined. The emotion intensity levels of emotional keywords and emotion modification words are estimated based on a collected emotion corpus. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.


Title:
Multiband Approach to Robust Text-independent Speaker Identification

Author:
Wan-Chen Chen, Ching-Tang Hsieh, and Eugene Lai

Abstract:
This paper presents an effective method for improving the performance of a speaker identification system. Based on the multiresolution property of the wavelet transform, the input speech signal is decomposed into various frequency bands in order not to spread noise distortions over the entire feature space. To capture the characteristics of the vocal tract, the linear predictive cepstral coefficients (LPCCs) of each band are calculated. Furthermore, the cepstral mean normalization technique is applied to all computed features in order to provide similar parameter statistics in all acoustic environments. In order to effectively utilize these multiband speech features, we use feature recombination and likelihood recombination methods to evaluate the task of text-independent speaker identification. The feature recombination scheme combines the cepstral coefficients of each band to form a single feature vector used to train the Gaussian mixture model (GMM). The likelihood recombination scheme combines the likelihood scores of the independent GMM for each band. Experimental results show that both proposed methods achieve better performance than GMM using full-band LPCCs and mel-frequency cepstral coefficients (MFCCs) when the speaker identification is evaluated in the presence of clean and noisy environments.

Keyword:
speaker identification, wavelet transform, linear predictive cepstral coefficient (LPCC), mel-frequency cepstral coefficient (MFCC), Gaussian mixture model (GMM).


Title:
An Innovative Distributed Speech Recognition Platform for Portable, Personalized and Humanized Wireless Devices

Author:
Yin-Pin Yang

Abstract:
In recent years, the rapid growth of wireless communications has undoubtedly increased the need for speech recognition techniques. In wireless environments, the portability of a computationally powerful device can be realized by distributing data/information and computation resources over wireless networks. Portability can then evolve through personalization and humanization to meet people�䏭 needs. An innovative distributed speech recognition (DSR) [ETSI, 1998],[ETSI, 2000] platform, configurable DSR (C-DSR), is thus proposed here to enable various types of wireless devices to be remotely configured and to employ sophisticated recognizers on servers operated over wireless networks. For each recognition task, a configuration file, which contains information regarding types of services, types of mobile devices, speaker profiles and recognition environments, is sent from the client side with each speech utterance. Through  configurability,  the capabilities of configuration, personalization and humanization can be easily achieved by allowing users and advanced users to be  involved in the design of  unique speech interaction functions of wireless devices.

Keyword:
Distributed, speech recognition, configurable, wireless, portable, personalized, humanized.