Application for Use of Sinica Balanced Corpus

The Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 4.0) is open to the research community through the WWW ( The size of this corpus is Ten million words. Each text in the corpus is classified and marked according to five criteria: genre, style, mode, topic, and source. The feature values of these classifications are assigned in a hierarchy. Subcorpora can be defined with a specific set of attributes to serve different research purposes. Texts in the corpus are segmented according to the word segmentation standard proposed by the ROC Computational Linguistic Society. Each segmented word is tagged with its part-of-speech. Linguistic patterns and language structures can be extracted from the tagged corpus via a corpus inspection program which can filter the data, generate statistics, sort, and identify collocations. 

Please complete the required documents as below and send them to ACLCLP at the following address:

The Association for Computational Linguistics and Chinese Language Processing
℅Institute of Information Science, Academia Sinica
128, Sec. 2, Academic Rd., Nankang, Taipei 115, Taiwan

Required documents:

The license fee:

Payment: please fill in the payment form  

Address:1F., No. 34, Ln. 3, Sec. 1, Jiuzhuang St., Nankang Dist., Taipei City, 115022, Taiwan
Tel:886-2-27881638, Fax:886-2-26519386,