Author:
Ren-Yuan
Lyu, Min-Siong Liang, Yuang-Chin Chiang
Abstract:
The Formosa
speech database (ForSDat) is a multilingual speech corpus collected at Chang
Gung University and sponsored by the National Science Council of Taiwan. It is
expected that a multilingual speech corpus will be collected, covering the three
most frequently used languages in Taiwan: Taiwanese (Min-nan), Hakka, and
Mandarin. This 3-year project has the goal of collecting a phonetically abundant
speech corpus of more than 1,800 speakers and hundreds of hours of speech.
Recently, the first version of this corpus containing speech of 600 speakers of
Taiwanese and Mandarin was finished and is ready to be released. It contains
about 49 hours of speech and 247,000 utterances.
Keyword:
Phonetic Alphabet,
Pronunciation Lexicon, Phonetically Balanced Word, Speech Corpus
Author:
Jhing-Fa Wang, Shun-Chieh Lin, Hsueh-Wei Yang, and Fan-Min Li
Abstract:
The critical issues involved in speech-to-speech translation
are obtaining proper source segments and synthesizing accurate target speech.
Therefore, this article develops a novel multiple-translation spotting method to
deal with these issues efficiently. Term multiple-translation spotting refers to
the task of extracting target-language synthesis patterns that correspond to a
given set of source-language spotted patterns in conditional multiple pairs of
speech patterns known to be translation patterns. According to the extracted
synthesis patterns, the target speech can be properly synthesized by using a
waveform segment concatenation-based synthesis method. Experiments were
conducted with the languages of Mandarin and Taiwanese. The results reveal that
the proposed approach can achieve translation understanding rates of 80% and 76%
on average for Mandarin/Taiwanese translation and Taiwanese/Mandarin
translation, respectively.
Keyword:
Multiple-Translation Spotting, Speech-to-Speech Translation
Author:
Jen-Tzung
Chien, Meng-Sung Wu, and Hua-Jui Peng
Abstract:
Language modeling plays a critical role for automatic speech
recognition. Typically, the n-gram language models suffer from the lack
of a good representation of historical words and an inability to estimate unseen
parameters due to insufficient training data. In this study, we explore the
application of latent semantic information (LSI) to language modeling and
parameter smoothing. Our approach adopts latent semantic analysis to transform
all words and documents into a common semantic space. The word-to-word,
word-to-document and document-to-document relations are, accordingly, exploited
for language modeling and smoothing. For language modeling, we present a new
representation of historical words based on retrieval of the most relevant
document. We also develop a novel parameter smoothing method, where the language
models of seen and unseen words are estimated by interpolating the k
nearest seen words in the training corpus. The interpolation coefficients are
determined according to the closeness of words in the semantic space. As shown
by experiments, the proposed modeling and smoothing methods can significantly
reduce the perplexity of language models with moderate computational cost.
Keyword:
language modeling, parameter
smoothing, speech recognition, and latent semantic analysis.
Author:
Ze-Jing
Chuang and Chung-Hsien
Wu
Abstract:
This paper presents an approach to emotion recognition from
speech signals and textual content. In the analysis of speech signals,
thirty-three acoustic features are extracted from the speech input. After
Principle Component Analysis (PCA) is performed, 14 principle components are
selected for discriminative representation. In this representation, each
principle component is the combination of the 33 original acoustic features and
forms a feature subspace. Support Vector Machines (SVMs)
are adopted to classify the emotional states. In text analysis, all emotional
keywords and emotion modification words are manually defined. The emotion
intensity levels of emotional keywords and emotion modification words are
estimated based on a collected emotion corpus. The final emotional state is
determined based on the emotion outputs from the acoustic and textual analyses.
Experimental results show that the emotion recognition accuracy of the
integrated system is better than that of either of the two individual
approaches.
Author:
Wan-Chen Chen, Ching-Tang Hsieh, and Eugene Lai
Abstract:
This paper presents an effective method for improving the
performance of a speaker identification system. Based on the multiresolution
property of the wavelet transform, the input speech signal is decomposed into
various frequency bands in order not to spread noise distortions over the entire
feature space. To capture the characteristics of the vocal tract, the linear
predictive cepstral coefficients (LPCCs) of each band are calculated.
Furthermore, the cepstral mean normalization technique is applied to all
computed features in order to provide similar parameter statistics in all
acoustic environments. In order to effectively utilize these multiband speech
features, we use feature recombination and likelihood recombination methods to
evaluate the task of text-independent speaker identification. The feature
recombination scheme combines the cepstral coefficients of each band to form a
single feature vector used to train the Gaussian mixture model (GMM). The
likelihood recombination scheme combines the likelihood scores of the
independent GMM for each band. Experimental results show that both proposed
methods achieve better performance than GMM using full-band LPCCs and mel-frequency
cepstral coefficients (MFCCs) when the speaker identification is evaluated in
the presence of clean and noisy environments.
Keyword:
speaker identification, wavelet
transform, linear predictive cepstral coefficient (LPCC),
mel-frequency cepstral
coefficient (MFCC), Gaussian mixture model
(GMM).
Author:
Yin-Pin Yang
Abstract:
In recent years, the rapid growth of wireless communications
has undoubtedly increased the need for speech recognition techniques. In
wireless environments, the portability of a computationally powerful device can
be realized by distributing data/information and computation resources over
wireless networks. Portability can then evolve through personalization and
humanization to meet people�䏭 needs. An innovative distributed speech
recognition (DSR) [ETSI, 1998],[ETSI, 2000] platform, configurable DSR (C-DSR),
is thus proposed here to enable various types of wireless devices to be remotely
configured and to employ sophisticated recognizers on servers operated over
wireless networks. For each recognition task, a configuration file, which
contains information regarding types of services, types of mobile devices,
speaker profiles and recognition environments, is sent from the client side with
each speech utterance. Through configurability, the capabilities of
configuration, personalization and humanization can be easily achieved by
allowing users and advanced users to be involved in the design of unique
speech interaction functions of wireless devices.
Keyword:
Distributed, speech recognition,
configurable, wireless, portable, personalized, humanized.