Automatic Extraction of the Multiple Semantic and Syntactic Categories of Words

D. Portnoy and P. Bock (USA)


Natural Language Processing, Data Mining


A single unsupervised algorithm called lexical context deconvolution (LCD) is proposed to discover the semantic categories (senses) of polysemous words (those with more than one meaning) and the syntactic categories (parts of speech) of ambiguous words (those with more than one part of speech), relying solely on training with raw and unannotated text. No dictionaries, part-of-speech lexicons, stop-word lists, etc., are required or used. The knowledge about semantic and syntactic categories is acquired by collecting statistics from the lexical contexts in which words are found. The method first finds compact clusters of semantically similar words, which are assumed to represent the different semantic categories present in training texts. Discovering a given polysemous word’s semantic categories is then treated as a problem of deconvolution. A target word’s coöccurrence feature vector is assumed to be a linear combination of the categories’ coöccurrence feature vectors. Thus a word’s semantic categories are discovered by finding the non negative least-squares solution to the system of linear equations formed by the word’s and the categories’ coöccurrence feature vectors. Finding syntactic categories of a word is accomplished by changing the word’s feature vectors from coöccurrences to n-grams; every other part of the algorithm remains unchanged.

Important Links:

Go Back