International Journal of Computa

International Journal of Computational Linguistics & Chinese Language Processing [銝剜�]
Vol. 11, No. 1, March 2006

Using Duration Information in Cantonese Connected-Digit Recognition
Yu Zhu and Tan Lee
[pdf | html]

Modeling Cantonese Pronunciation Variations for Large-Vocabulary Continuous Speech Recognition
Tan Lee, Patgi Kam and Frank K. Soong
[pdf | html]

A Maximum Entropy Approach for Semantic Language Modeling
Chuang-Hua Chueh, Hsin-Min Wang and Jen-Tzung Chien
[pdf | html]

Robust Target Speaker Tracking in Broadcast TV Streams
Junmei Bai, Hongchen Jiang, Shilei Zhang, Shuwu Zhang and Bo Xu
[pdf | html]

A Fast Framework for the Constrained Mean Trajectory Segment Model by Avoidance of Redundant Computation on Segment
Yun Tang, Wenju Liu, Yiyan Zhang and Bo Xu
[pdf | html]
Voice Activity Detection Based on Auto-Correlation Function Using Wavelet Transform and Teager Energy Operator
Bing-Fei Wu and Kun-Ching Wang
[pdf | html]

Title:
Using Duration Information in Cantonese Connected-Digit Recognition

Author:
Yu Zhu and Tan Lee

Abstract:
This paper presents an investigation on the use of explicit statistical duration models for Cantonese connected-digit recognition. Cantonese is a major Chinese dialect. The phonetic compositions of Cantonese digits are generally very simple. Some of them contain only a single vowel or nasal segment. This makes it difficult to attain high accuracy in the automatic recognition of Cantonese digit strings. Recognition errors are mainly due to the insertion or deletion of short digits. It is widely admitted that the hidden Markov model does not impose effective control on the duration of the speech segments being modeled. Our approach uses a set of statistical duration models that are built explicitly from automatically segmented training data. They parametrically describe the distributions of various absolute and relative duration features. The duration models are used to assess recognition hypotheses and produce probabilistic duration scores. The duration scores are added with an empirically determined weight to the acoustic score. In this way, a hypothesis that is competitive in acoustic likelihood, but unfavorable in temporal organization, will be pruned. The conventional Viterbi search algorithms for connected-word recognition are modified to incorporate both state-level and word-level duration features. Experimental results show that absolute state duration gives the most noticeable improvement in digit recognition accuracy. With the use of duration information, insertion errors are much reduced, while deletion errors increase slightly. It is also found that explicit duration models are more effective for slow speech than for fast speech.

Keyword:
Explicit Duration Modeling, Duration Features, Connected-Digit Recognition, Cantonese, Hidden Markov Models

Title:
Modeling Cantonese Pronunciation Variations for Large-Vocabulary Continuous Speech Recognition

Author:
Tan Lee, Patgi Kam and Frank K. Soong

Abstract:
This paper presents different methods of handling pronunciation variations in Cantonese large-vocabulary continuous speech recognition. In an LVCSR system, three knowledge sources are involved: a pronunciation lexicon, acoustic models and language models. In addition, a decoding algorithm is used to search for the most likely word sequence. Pronunciation variation can be handled by explicitly modifying the knowledge sources or improving the decoding method. Two types of pronunciation variations are defined, namely, phone changes and sound changes. Phone change means that one phoneme is realized as another phoneme. A sound change happens when the acoustic realization is ambiguous between two phonemes. Phone changes are handled by constructing a pronunciation variation dictionary to include alternative pronunciations at the lexical level or dynamically expanding the search space to include those pronunciation variants. Sound changes are handled by adjusting the acoustic models through sharing or adaptation of the Gaussian mixture components. Experimental results show that the use of a pronunciation variation dictionary and the method of dynamic search space expansion can improve speech recognition performance substantially. The methods of acoustic model refinement were found to be relatively less effective in our experiments.

Keyword:
Automatic Speech Recognition, Pronunciation Variation, Cantonese

Title:
A Maximum Entropy Approach for Semantic Language Modeling

Author:
Chuang-Hua Chueh, Hsin-Min Wang and Jen-Tzung Chien

Abstract:
The conventional n-gram language model exploits only the immediate context of historical words without exploring long-distance semantic information. In this paper, we present a new information source extracted from latent semantic analysis (LSA) and adopt the maximum entropy (ME) principle to integrate it into an n-gram language model. With the ME approach, each information source serves as a set of constraints, which should be satisfied to estimate a hybrid statistical language model with maximum randomness. For comparative study, we also carry out knowledge integration via linear interpolation (LI). In the experiments on the TDT2 Chinese corpus, we find that the ME language model that combines the features of trigram and semantic information achieves a 17.9% perplexity reduction compared to the conventional trigram language model, and it outperforms the LI language model. Furthermore, in evaluation on a Mandarin speech recognition task, the ME and LI language models reduce the character error rate by 16.9% and 8.5%, respectively, over the bigram language model.

Keyword:
Language Modeling, Latent Semantic Analysis, Maximum Entropy, Speech Recognition

Title:
Robust Target Speaker Tracking in Broadcast TV Streams

Author:
Junmei Bai, Hongchen Jiang, Shilei Zhang, Shuwu Zhang and Bo Xu

Abstract:
This paper addresses the problem of audio change detection and speaker tracking in broadcast TV streams. A two-pass audio change detection algorithm, which includes detection of the potential change boundaries and refinement, is proposed. Speaker tracking is performed based on the results of speaker change detection. In speaker tracking, Wiener filtering, endpoint detection of pitch, and segmental cepstral feature normalization are applied to obtain a more reliable result. The algorithm has low complexity. Our experiments show that the algorithm achieves very satisfactory results.

Keywords:
Speaker Tracking, Audio Segmentation, Entropy, GMM

Title:
A Fast Framework for the Constrained Mean Trajectory Segment Model by Avoidance of Redundant Computation on Segment

Author:
Yun Tang, Wenju Liu,Yiyan Zhang and Bo Xu

Abstract:
The segment model (SM) is a family of methods that use the segmental distribution rather than frame-based density (e.g. HMM) to represent the underlying characteristics of the observation sequence. It has been proved to be more precise than HMM. However, their high level of complexity prevents these models from being used in practical systems. In this paper, we propose a framework that can reduce the computational complexity of the Constrained Mean Trajectory Segment Model (CMTSM), one type of SM, by fixing the number of regions in a segment so as to share the intermediate computation results. Our work is twofold. First, we compare the complexity of SM with that of HMM and point out the source of the complexity in SM. Secondly, a fast CMTSM framework is proposed, and two examples are used to illustrate this framework. The fast CMTSM achieves a 95.0% string accurate rate in the speaker-independent test on our mandarin digit string data corpus, which is much higher than the performance obtained with HMM-based system. At the mean time, we successfully keep the computation complexity of SM at the same level as that of HMM.

Keyword:
Speech Recognition, Segment Model, Mandarin Digit String Recognition
��

Title:
Voice Activity Detection Based on Auto-Correlation Function Using Wavelet Transform and Teager Energy Operator

Author:
Bing-Fei Wu and Kun-Ching Wang

Abstract:
In this paper, a new robust wavelet-based voice activity detection (VAD) algorithm derived from the discrete wavelet transform (DWT) and Teager energy operation (TEO) processing is presented. We decompose the speech signal into four subbands by using the DWT. By means of the multi-resolution analysis property of the DWT, the voiced, unvoiced, and transient components of speech can be distinctly discriminated. In order to develop a robust feature parameter called the speech activity envelope (SAE), the TEO is then applied to the DWT coefficients of each subband. The periodicity of speech signal is further exploited by using the subband signal auto-correlation function (SSACF) for. Experimental results show that the proposed SAE feature parameter can extract the speech activity under poor SNR conditions and that it is also insensitive to variable-level of noise.

Keyword:
Voice Activity Detection, Auto-Correlation, Wavelet, Teager Energy
��

��