ACLCLP

Application for Use of Sinica Chinese Core Vocabulary (version 1.0)

Sinica Chinese Core Vocabulary (version 1.0) consists of 1,121 Chinese words that are derived from the intersection of the top 2000 (most frequently used) words in the Sinica Balanced Corpus and in the Taiwan Mandarin Conversational Corpus. The Sinica Balanced Corpus contains mainly Chinese texts, approximately 4.7 millions of Chinese words after some minor modifications on the original data, whereas the Taiwan Mandarin Conversational Corpus contains free conversations, task- and topic-oriented dialogues, approximately 500K of transcribed Chinese words. Sinica Chinese Core Vocabulary was produced based on the “Word List with Accumulated Word Frequency in Sinica Balanced Corpus 3.0” released by the Chinese Knowledge and Information Processing Group (CKIP) via the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) and the “Chinese Spoken Wordlist” released by Dr. Shu-Chuan Tseng. Words were segmented and POS-tagged by the CKIP automatic word segmentation and tagging system. Sinica Chinese Core Vocabulary puts together the most frequently used Chinese words appearing in both of the written and spoken forms. It covers 57.6% of word tokens in the Sinica Balanced Corpus and 86.1% in the Taiwan Mandarin Conversational Corpus. Sinica Chinese Core Vocabulary consists of word information about part of speech, frequency, ranking in both of the corpora as well as the corresponding English glossaries with Chinese examples and English translations. All Chinese characters are transcribed in Pinyin. Words written in identical characters, but belonging to different POS tags as well as words that have multiple writing conventions are regarded as different lexical units. Users can also find a list with a subset of the top 2000 words of the Sinica Balanced Corpus that do not appear in the core vocabulary. This list contains 879 words that are frequently used in the written language only, covering 13.1% of word tokens in the Sinica Balanced Corpus. Another list contains a subset of the top 2000 words of the Taiwan Mandarin Conversational Corpus that do not appear in the core vocabulary. 699 conversation-only high-frequency words make up 7.6% of the Taiwan Mandarin Conversational Corpus. Please note that due to the setting of corpus scenario some proper nouns in the conversational corpus are corpus-specific and should not be regarded as high-frequency words in conversation. For this reason, 180 words were excluded from the final conversation-only list. In addition, a set of 1,235 basic Chinese characters, covering the core, text-, and conversation-only vocabulary lists, is derived from the aforementioned three wordlists.
Sinica Chinese Core Vocabulary is the result of several research projects funded by Academia Sinica, and the ACLCLP is authorized to release it. Applicants should apply by signing the license agreement and complying with the terms on the license agreement.

Required documents:

An official statement from the applicant's affiliated institution certifying his/her status at this institution. Written statement from the applicant or his/her affiliated institution affirming that the corpus will be used for research only, and not for any commercial purpose.
Three(3) original copies of the Licensing Agreement.

The license fee: (institutional license is for 1-10 users)

Individuals (member): US$300
Individuals (non-member): US$320
Nonprofit Institutions (member): US$1,500
Nonprofit Institutions (non-member): US$1,600

Please complete the required documents as below and send them to ACLCLP at the following address:

The Association for Computational Linguistics and Chinese Language Processing
1F., No. 34, Ln. 3, Sec. 1, Jiuzhuang St., Nankang Dist., Taipei City, 115022, Taiwan

Payment: please fill in the payment form 　

Address:1F., No. 34, Ln. 3, Sec. 1, Jiuzhuang St., Nankang Dist., Taipei City, 115022, Taiwan
Tel:886-2-27881638, Fax:886-2-26519386, E-mail:[email protected]