Efficient Similarity Matching for Categorical Sequence based on Dynamic Partition

K. Cai, C. Chen, and H. Lin (PRC)


Time series, sequence matching, categorical sequence, dynamic partition and longest common subsequences


Sequence similarity matching is one of the most important data mining technologies and it has received much attention in recent years. However, most existing methods were specially proposed for numerical data sequences matching. There exist many limitations when applying them to the categorical data sequences, which consist of a few discrete categorical data. In this paper, we mainly studied the intrinsic features of categorical data sequence. Furthermore, based on the Longest Common Subsequences (LCS) similarity model, we brought forward a new matching schema that is applicable to categorical data sequences. We measured the similarity between sequences from four different angles on the basis of the similarity between subsequences of the compared sequences. On the problem pertaining to the construction of subsequences, which is the basic unit of sequence matching, we proposed a dynamic partition method. In contrast to the previous technologies, this method does not rely on a fix-sized window. Instead it is able to get subsequences automatically and each subsequence can possess different length. To conclude our paper, we implemented the recommended method and carried out experiments on synthetic data. The experimental results prove it an effective schema for categorical data sequence similarity matching.

Important Links:

Go Back