International Journal of Computational Linguistics & Chinese Language Processing                                   [中æ�]
                                                                                          Vol. 18, No. 1, March 2013


Title:
Lexical Coverage in Taiwan Mandarin Conversations

Author:
Shu-Chuan Tseng

Abstract:
Information about the lexical capacity of the speakers of a specific language is indispensible for empirical and experimental studies on the human behavior of using speech as a communicative means. Unlike the increasing number of gigantic text- or web-based corpora that have been developed in recent decades, publicly distributed spoken resources, espcially conversations, are few in number. This article studies the lexical coverage of a corpus of Taiwan Mandarin conversations recorded in three speaking scenarios. A wordlist based on this corpus has been prepared and provides information about frequency counts of words and parts of speech processed by an automatic system. Manual post-editing of the results was performed to ensure the usability and reliability of the wordlist. Syllable information was derived by automatically converting the Chinese characters to a conventional romanization scheme, followed by manual correction of conversion errors and disambiguiation of homographs. As a result, the wordlist contains 405,435 ordinary words and 57,696 instances of discourse particles, markers, fillers, and feedback words. Lexical coverage in Taiwan Mandarin conversation is revealed and is compared with a balanced corpus of texts in terms of words, syllables, and word categories.

Keywords: Taiwan Mandarin, Conversation, Frequency Counts, Lexical Coverage, Discourse Items


Title:
Learning to Find Translations and Transliterations on the Web based on Conditional Random Fields

Author:
Joseph Z. Chang, Jason S. Chang, and Jyh-Shing Roger Jang

Abstract:
In recent years, state-of-the-art cross-linguistic systems have been based on parallel corpora. Nevertheless, it is difficult at times to find translations of a certain technical term or named entity even with a very large parallel corpora. In this paper, we present a new method for learning to find translations on the Web for a given term. In our approach, we use a small set of terms and translations to obtain mixed-code snippets returned by a search engine. We then automatically annotate the data with translation tags, automatically generate features to augment the tagged data, and automatically train a conditional random fields model for identifying translations. At runtime, we obtain mixed-code webpages containing the given term and run the model to extract translations as output. Preliminary experiments and evaluation results show our method cleanly combines various features, resulting in a system that outperforms previous works.

Keywords:
Machine Translation, Cross-lingual Information Extraction, Wikipedia, Conditional Random Fields


Title:
Machine Translation Approaches and Survey for Indian Languages

Author:
Antony P. J.

Abstract:
The term Machine Translation is a standard name for computerized systems responsible for the production of translations from one natural language into another with or without human assistance. It is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. Many attempts are being made all over the world to develop machine translation systems for various languages using rule-based as well as statistically based approaches. Development of a full-fledged bilingual machine translation (MT) system for any two natural languages with limited electronic resources and tools is a challenging and demanding task. In order to achieve reasonable translation quality in open source tasks, corpus based machine translation approaches require large amounts of parallel corpora that are not always available, especially for less resourced language pairs. On the other hand, the rule-based machine translation process is extremely time consuming, difficult, and fails to analyze accurately a large corpus of unrestricted text. Even though there has been effort towards building English to Indian language and Indian language to Indian language translation system, unfortunately, we do not have an efficient translation system as of today. The literature shows that there have been many attempts in MT for English to Indian languages and Indian languages to Indian languages. At present, a number of government and private sector projects are working towards developing a full-fledged MT for Indian languages. This paper gives a brief description of the various approaches and major machine translation developments in India.

Keywords:
Corpus, Computational Linguistics, Statistical Approach, Interlingua Approach, Dravidian Languages0


Title:
Emotion Co-referencing - Emotional Expression, Holder, and Topic

Author:
Dipankar Das, and Sivaji Bandyopadhyay

Abstract:
The present approach aims to identify the emotional expression, holder, topic, and their co-reference from Bengali blog sentences. Two techniques are employed, one is a rule-based baseline system and the other is a supervised system that consists of different syntactic, semantic, rhetorical, and overlapping features. Different error cases have been resolved using rule-based post processing techniques. The evaluative vectors containing emotional expressions, holders, and topics are prepared from annotated blog posts as well as from system generated output. The evaluation metric, Krippendorff�䏭 α, achieves agreement scores of 0.53 and 0.67 for the baseline and supervised co-reference classification systems, respectively.

Keywords:
Emotional Expression, Holder, Topic, Co-reference Agreement


��


��