文件探勘技術研討會

文件探勘技術研討會

網路、企業組織、甚至個人擁有數位化資訊急遽成長，文件探勘是擷取、組織、利用這些資訊的重要技術領域。Google最近推出Scholar Google論文搜尋引擎即對散佈在網路的各種論文及引用資訊做有效擷取與組織，讓論文搜尋有了全新面貌。過去國內有關探討文件探勘技術的學術活動不多，本研討會希望邀請國內相關學者專家對此重要議題進行探討，歡迎有興趣此研究議題者一同參與。

時間: 93年12月28日(星期二)

地點：中央研究院資訊科學研究所

主辦單位：中央研究院資訊科學研究所、中華民國計算語言學學會

議程：報名表

時間

講題

主講人

09:00 ~ 09:25

報到

09:25 ~ 9:30

開幕歡迎詞

簡立峰(中研院資訊所)

9:30 ~ 10:20

資訊相關性和新穎性偵測及追蹤

陳信希教授(台灣大學資工系)

10:20 ~ 10:40

Break

10:40 ~ 11:20

Clustering and classifying patterns in text collections

鄭卜壬博士(中研院資訊所)

11:20 ~ 12:00

Information Extraction Techniques for Text Mining

王正豪博士(中研院資訊所)

12:00 ~ 13:30

午餐

13:30 ~ 14:20

A Unified Framework for Large Vocabulary Speech Recognition of Mutually Unintelligible Chinese "Regionalects"

許鈞南博士(中研院資訊所)

14:20 ~ 15:10

From Text Mining to Audio Mining

王新民博士(中研院資訊所)

15:10 ~ 15:30

Break

15:30 ~ 16:20

Personalized Document Clustering: A Collaborative-Filtering-Based Approach

魏志平教授(中山大學資管系)

資訊相關性和新穎性偵測及追蹤
陳信希

資訊爆炸是新資訊時代挑戰性的重要議題之一，如何由大量資料集中取得相關資訊是許多應用不可缺少的步驟之一。一般資訊檢索系統最小的處理單位是文件，檢索過程僅回傳滿足使用者資訊需求的文件，由於並未標示相關句子，使用者必須閱讀完整的文件，才能找到相關的資訊。同時，傳統資訊檢索系統也未能分辨哪一個(或哪些)句子貢獻新的資訊，使用者需瀏覽整段資料，才能挖掘最新的資訊。而過濾重複的資訊和標示新的資訊，已經成為一些重要應用的基礎，例如自動摘要、問答系統、生物資訊文件探勘等，句子相關性和新穎性分析益形重要。本演講將由文件和句子這兩個層次，分別探討資訊相關性和新穎性偵測及追蹤的技術和應用。

Clustering and classifying patterns in text collections

鄭卜壬

There already exist several research areas engaged in extracting useful patterns from text collections. The extracted patterns, in general, provide clues for many tasks such as document summarization and question answering. This talk will turn to how to automatically cluster and classify the patterns. By which, it makes possible the discovery of interesting knowledge. For example, one may create thematic overviews of the text collections or generate category metadata for the patterns of interest. This talk will outline the ideas and survey some related works. Many applications benefited by the technologies will also be given.

Information Extraction Techniques for Text Mining
王正豪

With the increasing amount of various forms of data, it's critical to extract relevant knowledge from the data collections. In particular, text mining focuses on the automatic discovery, extraction, and organization of knowledge embedded in textual documents. To extract structural and semantic information from unstructured texts, several information extraction approaches have been proposed. This talk presents a survey of these information extraction techniques for text mining.

A Unified Framework for Large Vocabulary Speech Recognition of

Mutually Unintelligible Chinese "Regionalects"
許鈞南

This talk presents a new approach to recognizing speech of mutually unintelligible spoken Chinese "regionalects" based on a unified three-layer framework and a one-stage searching strategy. Unlike the traditional approaches, the new approach avoids searching the intermediate local optimal syllable sequences or lattices. Instead, by using Hanzi as the searching nodes, the new approach can search to find the globally optimal character sequences directly. This talk reports the experiments on two regionalects widely used in Taiwan, i.e., Holo Taiwanese and Mandarin. Results show that the unified framework can efficiently deal with the issues of multiple pronunciations of the spoken regionalects. The character error reduction rate is 34.1%, which is achieved by using the new approach compared with the traditional two-stage scheme. Furthermore, the new approach is shown to be more robust when dealing with a poor uttered speech database.

From Text Mining to Audio Mining
王新民

The field of data mining has been growing at an amazing rate in the past ten years. Current trends in data mining include mining unstructured data such as text, audio, and video, mining data in distributed and heterogeneous databases, mining the World-Wide Web (WWW), etc. This talk will first give a brief introduction to multimedia data mining, and then survey some of the recent research in speech mining and music mining.

Personalized Document Clustering: A Collaborative-Filtering-Based Approach

魏志平

To manage the ever-increasing volume of documents, individuals and organizations frequently organize their documents into categories that facilitate document management and subsequent information access and browsing. However, document clustering is intentional acts that reflect individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, an effective document clustering must consider individual preferences and needs to support personalization in document categorization. In this study, we design and implement a collaborative-filtering-based document-clustering (CFC) technique by incorporating an individual’s and his/her neighbors’ partial clusterings for supporting personalized document clustering. The empirical evaluation results suggest that the use of an individual’s partial clustering can achieve a better personalized clustering result than does the content-based document clustering technique. Moreover, use of the collaborative-filtering approach for expanding an individual’s partial clustering can further improve personalized clustering, measured by cluster recall and precision.

時間	講題	主講人
09:00 ~ 09:25	報到
09:25 ~ 9:30	開幕歡迎詞	簡立峰(中研院資訊所)
9:30 ~ 10:20	資訊相關性和新穎性偵測及追蹤	陳信希教授(台灣大學資工系)
10:20 ~ 10:40	Break
10:40 ~ 11:20	Clustering and classifying patterns in text collections	鄭卜壬博士(中研院資訊所)
11:20 ~ 12:00	Information Extraction Techniques for Text Mining	王正豪博士(中研院資訊所)
12:00 ~ 13:30	午餐
13:30 ~ 14:20	A Unified Framework for Large Vocabulary Speech Recognition of Mutually Unintelligible Chinese "Regionalects"	許鈞南博士(中研院資訊所)
14:20 ~ 15:10	From Text Mining to Audio Mining	王新民博士(中研院資訊所)
15:10 ~ 15:30	Break
15:30 ~ 16:20	Personalized Document Clustering: A Collaborative-Filtering-Based Approach	魏志平教授(中山大學資管系)