Towards a Pre-Processing System for Casual English Annotated with Linguistic and Cultural Information

E. Clark, T. Roberts, and K. Araki (Japan)


Text Processing, Machine Translation, Educational Technology, Twitter


We present a preliminary revision of a text processing system, CECS (Casual English Conversion System) the purpose of which is to normalize the casual, error-ridden English that is frequently a feature of new media such as Twitter, into regular English. CECS has two applications: as pre-processing on input for Machine Translation or Information Retrieval systems, and as a standalone system to aid non-native speakers' reading comprehension of informal written English. The educational aspect of CECS is enhanced by the provision of manually compiled annotation on each word or phrase converted by the system. The system currently runs using a manually compiled database and a fairly straightforward text-to-text replacement method, but future plans include the implementation of a web mining algorithm for wider knowledge acquisition. Preliminary experiments produced positive results, suggesting that the basic concept and implementation of the system give it considerable potential as a pre-processing tool, and that the main task hereafter lies in the expansion of the database and addition of web mining and word-sense disambiguation automatic candidate selection algorithms.

