Unsupervised Discovery of Language Structure in Audio Signals

J. Elliott (UK)

Keywords

Audio, Language; Unsupervised, Learning; Significant Activity Segments (SAS); detection

Abstract

Having received a signal, unlike traditional speech processing, the aim of this research goal is not to identify where individual word boundaries begin and end or detect the pattern set, using supervised techniques, which comprise the signal's lexicon. The rationale that underpins this approach is therefore, not to decipher the audio signal content, as this is a secondary task and assumes language content exists, but to identify what constitutes the physical structure of spoken language, in contrast to other structured phenomena. In essence, to develop an automated (artificially intelligent) intuitive `ear' that can detect the rhythm and structure of language with the same accuracy (or better) of the human ear. To achieve this, unsupervised learning techniques, which do not rely on prior knowledge of a specific system, underpin generic methods devised to facilitate classification of unknown phenomena, if encountered. Results show that amplitude frequency histograms, derived from vertical, horizontal and thresholded analysis, clearly distinguish speech, `noise', and music with distinctive leptokurtic, platykurtic and either a `tooth comb' or bimodal profiles respectively. Birds and Apes demonstrate similar but coarser-grained versions of a leptokurtic distribution; however, dolphins and orcas produce almost identical profiles to humans, which indicate a similar complexity of sound pattern construction. Individually, the two types of visualisation methods (SAS time series and amplitude frequency histogram) mentioned above are reasonably robust in their ability to differentiate language from other signals. In particular, time series analysis of Significant Activity Segments is able to identify language-like communication within a transmission, which includes other structured phenomena, whether natural or artificial. However, combining these two methods produces a significantly more robust system, which is believed to be an extremely useful automated first-pass filter for identifying and distinguishing intelligent language-like audio communication, without the intervention of supervised techniques.

Important Links:

Go Back