International Journal of Computa

International Journal of Computational Linguistics & Chinese Language Processing [銝剜�]
Vol. 12, No. 3, September 2007

Speaker Identification Method Using Earth Mover�䏭 Distance for CCC Speaker Recognition Evaluation 2006
Shingo Kuroiwa, Satoru Tsuge, Masahiko Kita, and Fuji Ren
[pdf | html]

A Novel Characterization of the Alternative Hypothesis Using Kernel Discriminant Analysis for LLR-Based Speaker Verification
Yi-Hsiang Chao, Hsin-Min Wang, and Ruei-Chuan Chang
[pdf | html]

Integrating Complementary Features from Vocal Source and Vocal Tract for Speaker Identification
Nengheng Zheng, Tan Lee, Ning Wang and P. C. Ching
[pdf | html]

Performance of Discriminative HMM Training in Noise
Jun Du, Peng Liu, Frank K. Soong, Jian-Lai Zhou, and Ren-Hua Wang
[pdf | html]

Multilingual Spoken Language Corpus Development for Communication Research
Toshiyuki Takezawa, Genichiro Kikui, Masahide Mizushima, and Eiichiro Sumita
[pdf | html]
Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: a Class-Based Maximum Entropy Markov Model Approach
Jinghui Xiao, Bingquan Liu, and Xiaolong Wang
[pdf | html]

Title:
Speaker Identification Method Using Earth Mover�䏭 Distance for CCC Speaker Recognition Evaluation 2006

Author:
Shingo Kuroiwa, Satoru Tsuge, Masahiko Kita, and Fuji Ren

Abstract:
In this paper, we present a non-parametric speaker identification method using Earth Mover�䏭 Distance (EMD) designed for text-indepedent speaker identification and its evaluation results for CCC Speaker Recognition Evaluation 2006, organized by the Chinese Corpus Consortium (CCC) for the th International Symposium on Chinese Spoken Language Processing (ISCSLP 2006). EMD based speaker identification (EMD-IR) was originally designed to be applied to a distributed speaker identification system, in which the feature vectors are compressed by vector quantization at a terminal and sent to a server that executes a pattern matching process. In this structure, we had to train speaker models using quantized data, then we utilized a non-parametric speaker model and EMD. From the experimental results on a Japanese speech corpus, EMD-IR showed higher robustness to the quantized data than the conventional GMM technique. Moreover, it achieved higher accuracy than GMM even if the data was not quantized. Hence, we have taken the challenge of CCC Speaker Recognition Evaluation 2006 using EMD-IR. Since the identification tasks defined in the evaluation were on an open-set basis, we introduce a new speaker verification module. Evaluation results show that EMD-IR achieves 99.3 % Identification Correctness Rate in a closed-channel speaker identification task.

Keyword: Speaker Identification, Earth Mover�䏭 Distance, Non-Parametric, Vector Quantization, Chinese Speech Corpus

Title:
A Novel Characterization of the Alternative Hypothesis Using Kernel Discriminant Analysis for LLR-Based Speaker Verification

Author:
Yi-Hsiang Chao, Hsin-Min Wang, and Ruei-Chuan Chang

Abstract:
In a log-likelihood ratio (LLR)-based speaker verification system, the alternative hypothesis is usually difficult to characterize a priori, since the model should cover the space of all possible impostors. In this paper, we propose a new LLR measure in an attempt to characterize the alternative hypothesis in a more effective and robust way than conventional methods. This LLR measure can be further formulated as a non-linear discriminant classifier and solved by kernel-based techniques, such as the Kernel Fisher Discriminant (KFD) and Support Vector Machine (SVM). The results of experiments on two speaker verification tasks show that the proposed methods outperform classical LLR-based approaches.

Keyword:
Kernel Fisher Discriminant, Log-likelihood Ratio, Speaker Verification, Support Vector Machine.

Title:
Integrating Complementary Features from Vocal Source and Vocal Tract for Speaker Identification

Author:
Nengheng Zheng, Tan Lee, Ning Wang and P. C. Ching

Abstract:
This paper describes a speaker identification system that uses complementary acoustic features derived from the vocal source excitation and the vocal tract system. Conventional speaker recognition systems typically adopt the cepstral coefficients, e.g., Mel-frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC), as the representative features. The cepstral features aim at characterizing the formant structure of the vocal tract system. This study proposes a new feature set, named the wavelet octave coefficients of residues (WOCOR), to characterize the vocal source excitation signal. WOCOR is derived by wavelet transformation of the linear predictive (LP) residual signal and is capable of capturing the spectro-temporal properties of vocal source excitation. WOCOR and MFCC contain complementary information for speaker recognition since they characterize two physiologically distinct components of speech production. The complementary contributions of MFCC and WOCOR in speaker identification are investigated. A confidence measure based score-level fusion technique is proposed to take full advantage of these two complementary features for speaker identification. Experiments show that an identification system using both MFCC and WOCOR significantly outperforms one using MFCC only. In comparison with the identification error rate of 6.8% obtained with MFCC-based system, an error rate of 4.1% is obtained with the proposed confidence measure based integrating system.

Keyword:
Speaker Identification, Vocal Source Feature, Vocal Tract Feature, Information Fusion, Confidence Measure

Title:
Performance of Discriminative HMM Training in Noise

Author:
Jun Du, Peng Liu, Frank K. Soong, Jian-Lai Zhou, and Ren-Hua Wang

Abstract:
In this study, discriminative HMM training and its performance are investigated in both clean and noisy environments. Recognition error is defined at string, word, phone, and acoustic levels and treated in a unified framework in discriminative training. With an acoustic level, high-resolution error measurement, a discriminative criterion of minimum divergence (MD) is proposed. Using speaker-independent, continuous digit databases, Aurora2, the recognition performance of recognizers, which are trained in terms of different error measures and different training modes, is evaluated under various noise and SNR conditions. Experimental results show that discriminatively trained models perform better than the maximum likelihood baseline systems. Specifically, in MWE and MD training, relative error reductions of 13.71% and 17.62% are obtained with multi-training on Aurora2, respectively. Moreover, compared with ML training, MD training becomes more effective as the SNR increases.

Keywords:
Noise Robustness, Minimum Divergence, Minimum Word Error, Discriminative Training

Title:
Multilingual Spoken Language Corpus Development for Communication Research

Author:
Toshiyuki Takezawa, Genichiro Kikui, Masahide Mizushima, and Eiichiro Sumita

Abstract:
Multilingual spoken language corpora are indispensable for research on areas of spoken language communication, such as speech-to-speech translation. The speech and natural language processing essential to multilingual spoken language research requires unified structure and annotation, such as tagging. In this study, we describe an experience with multilingual spoken language corpus development at our research institution, focusing in particular on speech recognition and natural language processing for speech translation of travel conversations. An integrated speech and language database, Spoken Language DataBase (SLDB) was planned and constructed. Basic Travel Expression Corpus (BTEC) was planned and constructed to cover a variety of situations and expressions. BTEC and SLDB are designed to be complementary. BTEC is a collection of Japanese sentences and their translations, and SLDB is a collection of transcriptions of bilingual spoken dialogs. Whereas BTEC covers a wide variety of travel domains, SLDB covers a limited domain, i.e., hotel situations. BTEC contains approximately 588k utterance-style expressions, while SLDB contains about 16k utterances. Machine-aided Dialogs (MAD) was developed as a development corpus, and both BTEC and SLDB can be used to handle MAD-type tasks. Field Experiment Data (FED) was developed as the evaluation corpus. We conducted an experiment, and based on analysis of our follow-up questionnaire, roughly half the subjects of the experiment felt they could understand and make themselves understood by their partners.

Keyword:
Multilingual Corpus, Spoken Language, Speech Translation, Dialog, Communication.
��

Title:
Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: a Class-Based Maximum Entropy Markov Model Approach

Author:
Jinghui Xiao, Bingquan Liu, and Xiaolong Wang

Abstract:
The Pinyin-to-Character Conversion task is the core process of the Chinese pinyin-based input method. Statistical language model techniques, especially ngram-based models, are mostly adopted to solve that task. However, the ngram model only focuses on the constraints between characters, ignoring the pinyin constraints in the input pinyin sequence. This paper improves the performance of the Pinyin-to-Character Conversion system through exploitation of the pinyin constraints. The MEMM framework is used to describe the pinyin constraints and the character constraints. A Class-based MEMM (C-MEMM) model is proposed to address the MEMM efficiency problem in the Pinyin-to-Character Conversion task. The C-MEMM probability functions are strictly deduced and well formulized according to the Bayes rule and the Markov property. Both the cases of hard class and soft class are well discussed. In the experiments, C-MEMM outperforms the traditional ngram model significantly by exploitation of the pinyin constraints in the Pinyin-to-Character Conversion task. In addition, C-MEMM can well utilize the syntax and semantic information in word class and further improve the system performance.

Keyword:
Pinyin-to-Character Conversion, MEMM, Class-Based
��

��