Creating and Working with Corpus of Linguistic Data

Taking the Web-as-Corpus (Kilgarriff & Grefenstette, 2003) approach, this research project creates large-scale linguistic corpora by writing computer programs to automatically retrieving data from the Internet. For example, I have created an eight-million-word English-Chinese parallel corpus using bilingual articles from the Chinese division of the New York Times. Using posts from an online discussion forum, I also created a seven-million-word corpus of China English. In order to make these corpus resources more accessible to a wider research community, I have created a web-based interface that allows standard corpus analysis, including concordancing, collocations, distribution tables and charts, frequency lists and keyword analysis []. This is achieved by utilizing the CPQweb technology (Hardie, 2012), the fourth and latest generation of corpus analysis tool is built on IMS Open Corpus Workbench and MySQL relational database.

Exploiting Corpus Resources in Investigating Theoretically Important Questions in Applied Linguistics

Using existing corpus data and linguistic corpora I have created, I have examined a wide range of topics within the field of second language acquisition and applied linguistics. For example, in a chapter in the edited volume Automatic Treatment and Analysis of Learner Corpus Data (John Benjamins), my co-author and I examined syntactic complexity in second language writing using large-scale learner corpus data. This study reveals that non-native and native students’ writings differ in terms of length of production unit, amount of subordination, amount of coordination, and degree of phrasal sophistication, and that the higher-proficiency non-native group approximates the native group significantly better than the lower-proficiency non-native group in the areas of length of production unit and degree of phrasal sophistication. In another study titled A corpus-based study of lexicogrammar in China English, with my co-author, I examined the lexis-grammar interface in China English and argued that there are certain lexicogrammatical features — new ditransitive verbs, verb-complementation and collocation patterns — that can be considered as concrete instantiations of structural nativization in China English.

Interdisciplinary Synergy between Second Language Acquisition and Intelligent Computer-assisted Language Learning

In my dissertation research, I explore the interdisciplinary synergy between SLA and intelligent computer-assisted language learning (ICALL). Adopting the theoretical framework of sociocultural theory, my dissertation research propose a hybrid pedagogical model that combines principles of concept-based instruction (i.e., explanation, materialization, visualization, and verbalization), and capabilities of an ICALL system in helping intermediate-level American university students develop conceptual understanding of the notoriously difficult ba-construction in Chinese. The ICALL system has integrated the English-Chinese parallel corpus in order to allow students to take a data-driven language learning approach, and explore authentic linguistic samples of the ba-construction for themselves.

Developing Computational Tools for Applied Linguistics Research

In this research project, I examine how computational linguistics, particularly in the area of natural language processing (NLP), can be fruitfully explored in creating computational tools to inform applied linguistic research. Working with the Python programming language and the Django Web Framework, I have created several web-based computational tools that are capable of automatically computing a comprehensive set of lexical/syntactic complexity measures for both English and Chinese. These computational tools combine state-of-the-art natural language processing (NLP) techniques (e.g., part-of-speech tagging, syntactic parsing) with linguistic insights extracted from corpora (i.e., word frequency information).