Author:
Qian-Xiang Lin, Chia-Hui Chang, and Chen-Ling Chen,
Abstract:
In many Chinese text processing tasks, Chinese word segmentation is a vital and required step. Various methods have been proposed to address this problem using machine learning algorithm in previous studies. In order to achieve high performance, many studies used external resources and combined with various machine learning algorithms to help segmentation. The goal of this paper is to construct a simple and effective Chinese word segmentation tool without external resources, that is, a closed test for Chinese word segmentation. We use training data to construct a vocabulary to combine maximum matching word segmentation results with sequence labeling methods including hidden Markov model (HMM) and conditional random fields (CRF). The major idea is to provide machine learning algorithm with ambiguity information via forward and backward maximum matching as well as unknown word information via vocabulary masking. The experimental results show that maximum matching and vocabulary masking can significantly improve the performance of HMM segmentation (F-measure: 0.812 �� 0.948 �� 0.953). Meanwhile, combining maximum matching with CRF achieves a performance with 0.953 and is improved to 0.963 via vocabulary masking.
Keywords:
Chinese Word Segmentation, Maximal Matching, Hidden Markov Model, Conditional Random Field, Vocabulary Masking
Author:
Liang-Chih Yu, Chung-Hsien Wu, and Jui-Feng Yeh
Abstract:
Word sense disambiguation (WSD) is a technique used to identify the correct sense of polysemous words, and it is useful for many applications, such as machine translation (MT), lexical substitution, information retrieval (IR), and biomedical applications. In this paper, we propose the use of multiple contextual features, including the predicate-argument structure and named entities, to train two commonly used classifiers, Naïve Bayes (NB) and Maximum Entropy (ME), for word sense disambiguation. Experiments are conducted to evaluate the classifiers�� performance on the OntoNotes corpus and are compared with classifiers trained using a set of baseline features, such as the bag-of-words, n-grams, and part-of-speech (POS) tags. Experimental results show that incorporating both predicate-argument structure and named entities yields higher classification accuracy for both classifiers than does the use of the baseline features, resulting in accuracy as high as 81.6% and 87.4%, respectively, for NB and ME.
Keywords:
Word Sense Disambiguation, Predicate-Argument Structure, Named Entity, Natural Language Processing.
Author:
Wei-Ti Kuo, Chao-Shainn Huang, and Chao-Lin Liu
Abstract:
We investigated the problem of classifying short essays used in comprehension tests for senior high school students in Taiwan. The tests were for first and second year students, so the answers included only four categories, each for one semester of the first two years. A random-guess approach would achieve only 25% in accuracy for our problem. We analyzed three publicly available scores for readability, but did not find them directly applicable. By considering a wide array of features at the levels of word, sentence, and essay, we gradually improved the F measure achieved by our classifiers from 0.381 to 0.536.
Keywords:
Computer-assisted Language Learning, Readability Analysis, Document Classification, Short Essays for Reading Comprehension.
Author:
An-Ta Huang, Tsung-Ting Kuo, Ying-Chun Lai, and Shou-De Lin
Abstract:
This paper describes a framework that extracts effective correction rules from a sentence-aligned corpus and shows a practical application: auto-editing using the discovered rules. The framework exploits the methodology of finding the Levenshtein distance between sentences to identify the key parts of the rules and uses the editing corpus to filter, condense, and refine the rules. We have produced the rule candidates of such form, A
�� B, where A stands for the erroneous pattern and B for the correct pattern.
The developed framework is language independent; therefore, it can be applied to other languages. The evaluation of the discovered rules reveals that 67.2% of the top 1500 ranked rules are annotated as correct or mostly correct by experts. Based on the rules, we have developed an online auto-editing system for demonstration at http://ppt.cc/02yY.
Keywords:
Edit Distance, Erroneous Pattern, Correction Rrules, Auto Editing
Author:
Kuang-hua Chen
Abstract:
Internet has become a major channel for academic information dissemination in recent years. As a matter of fact, academic information, e.g., �𦣇all for papers��, �𦣇all for proposals��, �𦡞dvances of research��, etc., is crucial for researchers, since they have to publish research outputs and capture new research trends. This study focuses on extraction of academic conference information including topics, temporal information, spatial information, etc. Hope to reduce overhead of searching and managing conference information for researchers and improve efficiency of publication of research outputs. An automatic procedure for conference information retrieval and extraction is proposed firstly. A sequence of experiments is carried out. The experimental results show the feasibility of the proposed procedure. The F1 measure for text classification is over 80%; F1 measure and Recall for extraction of named entities are over 86% and 70%, respectively. A system platform for academic conference information retrieval and extraction is implemented to demonstrate the practicality. This system features functionalities of document retrieval, named entities extraction, faceted browsing, and calendar with a fusion of academic activities and daily life for researchers.
Keywords:
Academic Information, Information Extraction, Information Retrieval, Named Entities