Fast and Scalable Pattern Mining for Media-Type Focused Crawling
Umbrich, Jürgen ; Karnstedt, Marcel ; Harth, Andreas
Umbrich, Jürgen
Karnstedt, Marcel
Harth, Andreas
Loading...
Identifiers
http://hdl.handle.net/10379/1121
https://doi.org/10.13025/21376
https://doi.org/10.13025/21376
Repository DOI
Publication Date
2009
Keywords
Type
Workshop paper
Downloads
Citation
Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Scalable Pattern Mining for Media-Type Focused Crawling", KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009.
Abstract
Search engines targeting content other than hypertext documents require a crawler that discovers resources identifying files of certain media types. Naive crawling approaches do not guarantee a sufficient supply of new URIs (Uniform Resource Identifiers) to visit; effective and scalable mechanisms for discovering and crawling targeted resources are needed. One promising approach is to use data mining techniques to identify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occurring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and discuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an average of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.
Funder
Publisher
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland