Publication

µRaptor: A DOM-based system with appetite for hCard elements

Muñoz, Emir
Costabello, Luca
Vandenbussche, Pierre-Yves
Loading...
Thumbnail Image
Identifiers
http://hdl.handle.net/10379/6021
https://doi.org/10.13025/21411
Repository DOI
Publication Date
2014
Type
Workshop paper
Downloads
Citation
Muñoz, Emir, Costabello, Luca, & Vandenbussche, Pierre-Yves. (2014). µRaptor: a DOM-based system with appetite for hCard elements. Paper presented at the Proceedings of the Second International Conference on Linked Data for Information Extraction - Volume 1267, Riva del Garda, Italy.
Abstract
This paper describes µRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. µRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show 0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.
Publisher
CEUR-WS.org
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland