Web Data Extraction using Clustering

H. Xu and J. Park (Canada)


Data extraction, wrapper, data record extraction.


The Web is increasingly becoming a very large information source. However, the information is visually structured such that it is easy for humans to recognize data records and presentation patterns, but not for computers. In the paper, we study an automatic wrapper generator WDE (Web Data Extractor). The only input to the tool is an URL pointing to an HTML document. The goal of WDE we built is to be able to extract data from structured or loosely structured repeating data structure, including product listings, readers’ comments, sports scoreboard and forums. The tools analyzes an HTML document by assigning match score to each node and identify the repeating pattern by clustering the match score using k-means clustering algorithm. Experimental evaluation shows that it is slightly better than the leading tool MDR[11] .

Important Links:

Go Back