µRaptor: A DOM-based system with appetite for hCard elements

Muñoz, Emir
Costabello, Luca
Vandenbussche, Pierre-Yves
Muñoz, Emir, Costabello, Luca, & Vandenbussche, Pierre-Yves. (2014). µRaptor: a DOM-based system with appetite for hCard elements. Paper presented at the Proceedings of the Second International Conference on Linked Data for Information Extraction - Volume 1267, Riva del Garda, Italy.
This paper describes µRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. µRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show 0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland