Author:
Yunming Ye, Hongbo Li, Xiaobai Deng, and Joshua Zhexue Huang
Abstract:
Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naïve Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
Keywords:
Search Interface Detection, Random Forest, Hidden Web, Form Classification
Author:
Liang-Chih Yu, Chung-Hsien Wu, Jui-Feng Yeh, and Eduard Hovy
Abstract:
Word sense annotated corpora are useful resources for many text mining applications. Such corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, nobody has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective in identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% of the remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.
Keywords:
Corpus Cleanup, Word Sense Disambiguation, Semantic Analysis, Entropy
Author:
Cheng-Zen Yang, Ing-Xiang Chen, Cheng-Tse Hung, and Ping-Jung Wu
Abstract:
In recent years, the hierarchical taxonomy integration problem has obtained considerable attention in many research studies. Many types of implicit information embedded in the source taxonomy are explored to improve the integration performance. The semantic information embedded in the source taxonomy, however, has not been discussed in previous research. In this paper, an enhanced integration approach called SFE (Semantic Feature Expansion) is proposed to exploit the semantic information of the category-specific terms. From our experiments on two hierarchical Web taxonomies, the results show that the integration performance can be further improved with the SFE scheme.
Keywords:
Hierarchical Taxonomy Integration, Semantic Feature Expansion, Category-Specific Terms, Hierarchical Thesauri Information
Author:
Jen-Liang Chou, and Shih-Hung Wu
Abstract:
Wikipedia is the world�䏭 largest collaboratively edited source of encyclopedic knowledge. Wikibook is a sub-project of Wikipedia that is intended to create a book that can be edited by various contributors, similar to how Wikipedia is composed and edited. Editing a book, however, requires more effort than editing separate articles. Therefore, methods of quickly prototyping a book is a new research issue. In this paper, we investigate how to automatically extract content from Wikipedia and generate a prototype of a Wikibook as a start point for further editing. Applying search technology, our system can retrieve relevant articles from Wikipedia. A table of contents is built automatically and is based on a two-stage searching method. Our experiments show that, given a keyword as the title of a book, our system can generate a table of contents, which can be treated as a prototype of a Wikibook. Such a system can help free textbook editing. We propose an evaluation method based on the comparison of system results to a traditional textbook and show the coverage of our system.
Keywords:
Wikipedia, Wikibook, Table of Contents Generation