µRaptor: A DOM-based system with appetite for hCard elements
Muñoz, Emir ; Costabello, Luca ; Vandenbussche, Pierre-Yves
Muñoz, Emir
Costabello, Luca
Vandenbussche, Pierre-Yves
Loading...
Identifiers
http://hdl.handle.net/10379/6021
https://doi.org/10.13025/21411
https://doi.org/10.13025/21411
Repository DOI
Publication Date
2014
Keywords
Type
Workshop paper
Downloads
Citation
Muñoz, Emir, Costabello, Luca, & Vandenbussche, Pierre-Yves. (2014). µRaptor: a DOM-based system with appetite for hCard elements. Paper presented at the Proceedings of the Second International Conference on Linked Data for Information Extraction - Volume 1267, Riva del Garda, Italy.
Abstract
This paper describes µRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. µRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show 0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.
Funder
Publisher
CEUR-WS.org
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland