Author:
Neil Edward Barrett and Li-mei Chen,
Abstract:
The English articles, the, indefinite a/an, and zero can often be troublesome for English language learners to master, especially in longer texts. Thomas (1989) demonstrated that English as a second language (L2) learners from first languages (L1) that do not have the equivalent of an article system encounter more problems using articles. Ionin and Wexler (2004) found that such learners fluctuate between the semantic parameters of definiteness and specificity. This study examines English L2 article use with Taiwanese English learners to determine the potential factors influencing English article substitution and error patterns in their academic writing. This corpus-based analysis used natural data collected for the Academic Writing Textual Analysis (AWTA) corpus. A detailed online tagging system was constructed to examine article use, covering the semantic (specific and hearer knowledge) as well as the other features of the English article. The results indicated that learners overused both the definite and indefinite articles but underused the zero article. The definite article was substituted for the indefinite article in specific environments. Although no significant difference existed between specific and non-specific semantic environments in zero article errors, a significant difference emerged between plural and mass/non-count nouns. These results suggest that, in regard to writing, learners need to focus on the semantic/pragmatic relationships of specificity and hearer (or reader) knowledge.
Keywords:
Definite Article, Indefinite Article, Zero Article, Hearer Knowledge
Author:
F. Y. August Chao, Siaw-Fong Chung
Abstract:
In this study, we utilize a quantitative method measuring the Multi-Level Semantic Relations based on 4549 Mandarin lexemes containing the radical
mu4 (��). The research is carried out by first extracting all dictionary definitions for all lexemes containing this radical. Then, we consider the different layers of definitions (e.g., the definitions of the keywords in a definition) and measure whether two different
mu4 (��) lexemes are related in meanings. It was found that both width (the number of lexemes covered) and depth (the number of levels to be calculated) contribute to the measurement of semantic relatedness. Some seemingly unrelated
mu4 (��) lexemes are found related when the depth of definitions increases. The study also compares two sets of results - one based on MI value and the other based on t-score. Our findings show that our measurement based on multi-level semantic relations produces better results than MI value does, as a collocation measurement like MI value is less suitable for analyzing semantically related dictionary entries.
Keywords:
Definition relation, Multi-Level Semantic Relation, Dictionary, Corpus, Mandarin radical
mu4 (��).
Author:
Bor-Shen Lin and Yi-Cong Chen
Abstract:
With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains.
The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as �𨀣µ·è§雴��麨�� (Cape No. 7, the name of a film), �𡏭�ç���°â�� (Crayon Shinchan, the name of a cartoon figure), �𣈯��齿µ·�¯â�� (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.
Keywords:
Unknown Word Extraction, Word Identification, Machine Learning, Multilayer Perceptrons, Histogram Equalization
Author:
Chieh-Jen Wang, and Hsin-Hsi Chen
Abstract:
Detecting intent shift is fundamental for learning users�� behaviors and applying their experiences. In this paper, we propose a search-query-log based system to predict users�� intent shifts. We begin with selecting sessions in search query logs for training, extracting features from the selected sessions, and clustering sessions of similar intent. The resulting intent clusters are used to predict intent shift in testing data. The experimental results show that the proposed model achieves an accuracy of 0.5099, which is significantly better than the baselines. Moreover, the miss rate and spurious rate of the model are 0.0954 and 0.0867, respectively.
Keywords:
Intent Shift Detection, Intent Analysis, Search Query Logs Analysis
Author:
Darren Hsin-Hung Lin and Shelley Ching-Yu Hsieh
Abstract:
This paper presents a corpus-driven linguistic approach to embodiment in modern patent language as a contribution to the growing needs in intellectual property rights. While there is work that appears to fill a niche in English for Specific Purposes (ESP), the present study suggests that a statistical retrieval approach is necessary for compiling a patent technical word list to expand learner vocabulary size. Since a significant percentage of technical vocabulary appears within the range of independent claim among claim lexis, this study examines the essential features to show how it was characterized with respect to the linguistic specificity of patent style. It is further demonstrated how the proposed approach to the term independent claim contained in the patent specification is reliable for patent application on an international level. For example, clausal types that specify how clauses are used in U.S. patent documents under co-occurrence relations are potential for patent writing, while verb-noun collocations allow learners to grip hidden semantic prosodic associations. In short, the research content and statistical investigations of our approach highlight the pedagogical value of Patent English for ESP teachers, applied linguists, and the development of interdisciplinary research.
Keywords:
Intellectual Property Rights, Patent Document Processing, Corpus, Systemic Functional Linguistics, Co-Ocurrence