International Journal of Computational Linguistics & Chinese Language Processing
Vol. 2, No. 1, February 1997


Title:
Computational Tools and Resources for Linguistic Studies

Author:
Yu-Ling Una Hsu, Jing-Shin Chang and Keh-Yih Su

Abstract:
This paper presents several useful computational tools and available resources to facilitate linguistic studies. For each computational tool, we demonstrate why it is useful and how can it be used for research. In addition, linguistic examples are given for illustration. First, a very useful searching engine, Key Word in Context (KWIC), is introduced. This tool can automatically extract linguistically significant patterns from large corpora and help linguists discover syntagmatic generalizations. Second, Dynamic Clustering and Hierarchical Clustering are introduced for identifying natural clusters of words or phrases in distribution. Third, statistical measures which could be used to measure the degree of cohesion and correlation among linguistic units are presented. These tools can help linguists identify the boundaries of lexical units. Fourth, alignment tools for aligning parallel texts at the word, sentence and structure levels are presented for linguists who do comparative studies of different languages. Fifth, we introduce Sequential Forward Selection (SFS) and Classification and Regression Tree (CART) for automatic rule ordering. Finally, some available electronic Chinese resources are described to provide reference purposes for those who are interested.

Keyword:
extraction, clustering, cohesion, alignment, Chinese corpora, electronic dictionary


Title:
Measuring Relationship among Dialects: DOC and Related Resources

Author:
Chin-Chuan Cheng

Abstract:
This paper is a synthesis of the past studies in measurements of dialect relationships. The phonological data of 17 Chinese dialects that were computerized in the late 1960s have been utilized for measurements of dialect distance. In addition, a file of over 6,400 lexical variants in 18 dialects was also used to quantify dialect affinity. This writing first explains the nature, the organization, and the coding of these files. A series of steps illustrate how the phonological file was processed to derive the needed information for calculation of correlation coefficients. The coefficients are considered as indices of dialect affinity. The dialects are then grouped by the average linking method of cluster analysis of the coefficients. The appropriateness of the correlation method to the data is then discussed. Recent work on calculation of dialect mutual intelligibility is presented to indicate the future direction of research.

Keyword:
Chinese dialects, measurements of affinity, measurements of mutual intelligibility, comparative dialectology


Title:
MAT -- A Project to Collect Mandarin Speech Data Through Telephone Net works in Taiwan

Author:
Hsiao-Chuan Wang

Abstract:
A cooperative project, called Polyphone, was initiated by the Coordinating Committee on Speech Databases and Speech I/O Systems Assessment (COCOSDA) in 1992. Accordingly, a project to collect Mandarin speech data across Taiwan (MAT) was conducted by a group of researchers from several universities and research organizations in Taiwan. The purpose was to generate a speech corpus for the development of Mandarin-based speech technology and products. The speech data were collected at eight recording stations through telephone networks. The speakers were chosen so as to reflect the population of the gender, the dialect, the educational level, and the residence in Taiwan. A preliminary Mandarin speech database of 800 speakers has been produced. The final goal is to generate a speech database of at least 5000 speakers.

Keyword:
Mandarin speech, Speech database, Speech I/O systems assessment, Telephone network


Title:
A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Applications

Author:
Benjamin K. T'sou, Hing-Lung Lin, Godfrey Liu, Terence Chan, Jerome Hu, Ching-hai Chew, and John K.P. Tse

Abstract:
Similar to other languages such as English, Spanish and Arabic, Chinese is used by a large number of speakers in distinct speech communities which, despite sharing the unity of language, vary in interesting ways, and a systematic study of such linguistic variation is invaluable to appreciate the diversity and richness of the underlying cultures. This paper describes Project LIVAC (Linguistic Variation in Chinese Communities), which focuses on the development of a Chinese corpus, based on data taken concurrently at regular intervals from multiple Chinese speech communities. The resulting database and computerized concordance from the approximately 20 million word corpus with uniform time reference points extending across two years enable linguists and social scientists to undertake meaningful qualitative and quantitative comparative analysis of the development of linguistic and cultural variation. To facilitate these studies, a framework for integrating the corpus with specific corpus analysis applications is proposed. Based on this framework, a prototype retrieval system, which supports longitudinal studies on word and concept distribution, as well as lexical and other linguistic variation, is designed and implemented.


Title:
A Survey of Full-text Data Bases and Related Techniques for Chinese Ancient Documents in Academia Sinica (
銝剖亢��𠉛弦�堺�𧂈蝐滚�冽�����坔澈���䔄撅閙�閬�)

Author:
Hsieh Ching-Chun, Lin Shih (
雓脲�靽�, ��埈苊)

Abstract:
A survey of full-text data bases and related text processing techniques for Chinese ancient document in the past 12 years in Academia Sinica is presented in this paper. Five Institutes, (namely the Institute of History and phonology, the Institute of Taiwan History, the Institute of Literature and Philosophy, the Institute of Information Science and the Institute of Modern History ) and the Computing Center of Academia Sinica actively participated in this long range project since 1984. Beside, the Archival Library of National History also participated in developing the database of Ching Dynasty. Since 1995, some co-laboration projects with other Universities, such as London University in England, Stanford University, Michigan University in USA, Chinese University in Hong Kong and Chung-Cheng University, Chung-San University and National Taiwan Normal University in Taiwan have been launched to produce more digital texts. Now, the total character count of on-line full-text data bases are over 115 millions, and the data bases of more then 80 million characters are coming. In this report, we also survey some important techniques developed, including the structure of full-text database, the ways of handling missing characters, the management of data entry jobs, the development of markup system, etc. Besides, the status of some on going related research projects are summarized in this paper as a future perspective of the development of digital Chinese ancient documents.

銝剖亢��𠉛弦�堺�⏚�鍂閮��埈�蠘�閧��𧂈蝐滚歇��匧�鈭�僑嚗��嗡葉隞亙�冽�����坔澈���䔄撅閙���㛖�𡁶𤌍嚗𣬚𤌍��滢�羓�𡁶��冽�����坔澈���蜇摮埈彍撌脰��𦒘���銝�隞蠘𨯬摮梹���嗆��鍂����銵枏���函眏�堺�批�䔶��䌊銵屸�讠䔄�������ˊ雿𡏭���坔澈���望�劐�娍�嚗𡁜蟮隤墧����枂�蟮����鞈��𦠜���餈穃蟮�������搻��嚗䔶誑��𦠜𧋦�堺閮��𦯀葉敹�嚗𣬚蜇蝯勗�𨅯�见蟮擗其漲蝛齿扔�������蟮鞈��坔澈銋钅�讠䔄��1995撟湧�见�页�峕�劐�𥕦之摮貉��𧋦�堺�䔄撅訫���𣈯�靝��曹澈�𧂈蝐滩���辷����𡠺��见�抒�銝剖控儮睲葉甇��穃葦憭批�憭批飛嚗��见�𣇉��急襥憭批飛儮穃蟮銝嫣�𥕦之摮賂�穃�镼踵覔憭批飛儮煾�蹱葛銝剜��之摮貊�剹���𧋦����硋���讠晶���冽�����坔澈���䔄撅閧𣶹瘜�嚗��嗆活隞讠晶�䌊銵屸�讠䔄���㮾��𨀣�銵橒����𡠺嚗𡁜�冽�����坔澈��蝯鞉�页�烐���删�璅躰�𣬚頂蝯梧�𤏸���嗵蒈��銋讠恣��儮𤑳撩摮烾�惩�𦯀�讠恣��隞亙�羓𤌍��滚��鱓雿滨㮾��𦦵���𠉛弦�䔄撅閗���蝑剹��

Keyword:
Full-text Database, Markup, Full-text Search, HTML, CTP, Font Database, Content Index


Title:
Historical Corpora for Synchronic and Diachronic Linguistics Studies (
撱箸�衤��衤誑�望����風��隤噼���𠉛弦�箏�𤾸�𤑳�甇瑕蟮隤墧�坔澈)

Author:
Pei-chuan Wei, P.M. Thompson, Cheng-hui Liu, Chu-Ren Huang, Chaofen Sun (
擳誩畺瘜�, 霅𡁏邪璉�, ��㗇㗁��, 暺�撅�隞�, 摮急�嘥幼)

Abstract:
The Academia Sinica Ancient Chinese Corpus is designed for linguistic research. The corpus contains ancient texts that are selected because of their usefulness in grammatical and lexical studies, as well as an inspection program with keyword searching, statistics, and collocation functions. The corpus is divided into three subcorpora according to stages of grammatical developments, thus both synchronic and diachronic studies can be performed on them. Their current sizes are as follows: a. Old Chinese subcorpus (from pre-Qin to Pre-Han): 5,128,068 characters. b. Middle Chinese subcorpus (from Late Han to the Six Dynasties): 8,101,662 characters. c. Early Mandarin Chinese subcorpus (from Tang to Ching): 4,406,381 characters. A great portion of the texts from the Old Chinese subcorpus (4,497,051 characters) has been textually classified and marked-up according to their source books , author, text genre etc. A substantive part (520,794 characters) of the same subcorpus has also been segmented into words, which are in turn given part-of-speech tagging. results of the above two tasks form the basis of our Old Chinese Lexical Database.

銝剖亢��𠉛弦�堺�𧂈瞍Z�噼�墧�坔澈�糓�箏𧂈瞍Z�噼�噼���𠉛弦�峕�见遣�����坔�贝�墧�坔澈銝滢��瑟�匧之��讐��虾雿𦦵�箏𧂈瞍Z�噼�墧�訫�𡃏�𧼮�嗵�𠉛弦���𤓖摮鞉��㭱嚗諹�䔶�娍���匧虾隞亙�齿��㭱��隤噼�鮋�脰�峕炎蝝U��蝯梯�����𨰹�滨�憭𡁜�蠘�賜�见�譌��隞亥�墧�閧��䔄撅閧�箸�吔�屸�坔�贝�墧�坔澈����雿靝�𠰴𧂈瞍Z�𠺶��銝剖𧂈瞍Z�𠺶��餈睲誨瞍Z�䂿�劐�匧�𧢲活隤墧�坔澈嚗𣬚㮾靽⊿�蹱見������撠滚𧂈瞍Z�䂿��望���𡝗風������𠉛弦�賣糓��㛖�箔噶��羓��� �𣶹�銁銝𠰴𧂈瞍Z�噼�墧�坔澈銝剜�厩㮾�訜�彍��讐����㭱撌脩�㮖�脲�𡁜�嗅�笔�詻��雿𡏭��������𠉛�厩�匧�峕�𣂷���憿𧼮�𦠜�蹱釣��撌乩�頣���嗡葉����劐�滚�烐��㭱撌脩�枏�帋��𪃾閰痹��銁撌脫𪃾閰䂿����㭱銝剖���匧嗾�典𧂈蝐滚歇摰峕�鞱�鮋�䂿�璅躰�塩���嗘�𥟇𪃾閰硺誑��𡃏�鮋�墧�躰�条���鞉�𦦵𣶹撌脫�𧢲�鞉�穃�睲�𠰴𧂈瞍Z�噼�𧼮�坔澈���抅蝷汿��

Keyword:
corpus, lexical database, part-of-speech, mark-up, tagging, Old Chinese, Middle Chinese, Early Mandarin Chinese.