Author:
Cheng-Yuan Lin, Jyh-Shing Roger Jang and Kuan-Ting Chen
Abstract:
Precise phone/syllable boundary labeling
of the utterances in a speech corpus plays an important role in constructing a
corpus-based TTS (text-to-speech) system. However, automatic labeling based on
Viterbi forced alignment does not always produce satisfactory results. Moreover,
a suitable labeling method for one language does not necessarily produce
desirable results for another language. Hence in this paper, we propose a new
procedure for refining the boundaries of utterances in a Mandarin speech corpus.
This procedure employs different sets of acoustic features for four different
phonetic categories. In addition, a new scheme is proposed to deal with the
�𦑩eriodic voiced + periodic voiced�� case, which produced most of the
segmentation errors in our experiment. Several experiments were conducted to
demonstrate the feasibility of the proposed approach.
Keyword:
speech assessment methods phonetic
alphabet, speech corpus, sequential forward selection, k-nearest neighbor rule,
leave-one-out, speaker-adapted model, context-dependent hidden Markov model
(HMM)
Author:
Elizabeth Zeitoun and Ching-Hua Yu
Abstract:
In this paper, we deal with the linguistic
analysis approach adopted in the Formosan Language Corpora, one of the three
main information databases included in the Formosan Language Archive, and the
language processing programs that have been built upon it. We first discuss
problems related to the transcription of different language corpora. We then
deal with annotation rules and standards. We go on to explain the linguistic
identification of clauses, sentences and paragraphs, and the computer programs
used to obtain an alignment of words, glosses and sentences in Chinese and
English. We finally show how we try to cope with analytic inconsistencies
through programming. This paper is a complement to Zeitoun et al. [2003]
in which we provided an overview of the whole architecture of the Formosan
Language Archive.
Keyword:
Formosan languages, Formosan Language
Archive, corpora, linguistic analysis, language processing
Author:
Shu-Chuan Tseng
Abstract:
This paper describes the collection and
processing of a pilot speech corpus annotated in dialogue acts. The Mandarin
Topic-oriented Conversational Corpus (MTCC) consists of annotated transcripts
and sound files of conversations between two familiar persons. Particular
features of spoken Mandarin, such as discourse particles and paralinguistic
sounds, are taken into account in the orthographical transcription. In addition,
the dialogue structure is annotated using an annotation scheme developed for
topic-specific conversations. Using the annotated materials, we present the
results of a preliminary analysis of dialogue structure and dialogue acts.
Related transcription tools and web query applications are also introduced in
this paper.
Keyword:
Taiwan Mandarin, dialogue act, speech corpus
Author:
Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo and Shih-Sian Cheng
Abstract:
The MATBN Mandarin Chinese broadcast news corpus
contains a total of 198 hours of broadcast news from the Public Television
Service Foundation (Taiwan) with corresponding transcripts. The primary purpose
of this collection is to provide training and testing data for continuous speech
recognition evaluation in the broadcast news domain. In this paper, we briefly
introduce the speech corpus and report on some preliminary statistical analysis
and speech recognition evaluation results.
Keywords:
broadcast news, corpus, speech recognition, Mandarin Chinese,
transcription, annotation
Author:
Hsien-Chang Wang, Chung-Hsien Yang, Jhing-Fa Wang,
Chung-Hsien Wu and Jen-Tzung Chien
Abstract:
This paper describes a project that aims to
create a Mandarin speech database for the automobile setting (TAICAR). A group
of researchers from several universities and research institutes in Taiwan have
participated in the project. The goal is to generate a corpus for the
development and testing of various speech-processing techniques. There are six
recording sites in this project. Various words, sentences, and spontaneously
queries uttered in the vehicular navigation setting have been collected in this
project. A preliminary corpus of utterances from 192 speakers was created from
utterances generated in different vehicles. The database contains more than
163,000 files, occupying 16.8 gigabytes of disk space.
Keyword:
TAICAR, in-car speech, speech database, multi-channel
recording, corpus collection and annotation
��
Abstract:
This paper describes our initial attempt
to design and develop a bilingual reading comprehension corpus (BRCC). RC is a
task that conventionally evaluates the reading ability of an individual. An RC
system can automatically analyze a passage of natural language text and generate
an answer for each question based on information in the passage. The RC task can
be used to drive advancements of natural language processing (NLP) technologies
imparted in automatic RC systems. Furthermore, an RC system presents a novel
paradigm of information search, when compared to the predominant paradigm of
text retrieval in search engines on the Web. Previous works on automatic RC
typically involved English-only language learning materials (Remedia and
CBC4Kids) designed for children/students, which included stories, human-authored
questions, and answer keys. These corpora are important for supporting empirical
evaluation of RC performance. In the present work, we attempted to utilize RC as
a driver for NLP techniques in both English and Chinese. We sought parallel
English, and Chinese learning materials and incorporated annotations deemed
relevant to the RC task. We measured the comparative levels of difficulty among
the three corpora by means of the baseline bag-of-words (BOW) approach. Our
results show that the BOW approach achieves better RC performance in BRCC (67%)
when compared to Remedia (29%) and CBC4Kids (63%). This reveals that BRCC has
the highest degree of word overlap between questions and passages among the
three corpora, which artificially simplifies the RC task. This result suggests
that additional effort should be devoted to authoring questions with a various
grades of difficulty in order for BRCC to better support RC research across the
English and Chinese languages.
Keyword:
bilingual, reading comprehension, corpus
��
Author:
Chang-Shing Lee, Yau-Hwang Kuo, Chia-Hsin Liao and Zhi-Wei
Jian
Abstract:
In order to efficiently manage and use
knowledge, ontology technologies are widely applied to various kinds of domain
knowledge. This paper proposes a Chinese term clustering mechanism for
generating semantic concepts of a news ontology. We utilize the parallel fuzzy
inference mechanism to infer the conceptual resonance strength of a
Chinese term pair. There are four input fuzzy variables, consisting of a
Part-of-Speech (POS) fuzzy variable, Term Vocabulary (TV)
fuzzy variable, Term Association (TA) fuzzy variable, and
Common Term Association (CTA) fuzzy variable, and one output fuzzy
variable, the Conceptual Resonance Strength (CRS), in the
mechanism. In addition, the CKIP tool is used in Chinese natural language
processing tasks, including POS tagging, refining tagging, and stop word
filtering. The fuzzy compatibility relation approach to the semantic
concept clustering is also proposed. Simulation results show that our approach
can effectively cluster Chinese terms to generate the semantic concepts of a
news ontology.
Keyword:
Ontology, Chinese Natural Language Processing, Fuzzy
Inference, Feature Selection, Concept Clustering
��