Loading...
Building realistic ground truth datasets of personal identification information for entity matching
Aribilola, Ifeoluwapo ; Catena, Matteo ; Asghar, Mamoona ; Breslin, John ; Delbru, Renaud
Aribilola, Ifeoluwapo
Catena, Matteo
Asghar, Mamoona
Breslin, John
Delbru, Renaud
Files
Building Realistic Ground Truth Datasets of PII for EM.pdf
Adobe PDF, 597.88 KB
- Embargoed until 2026-08-10
Citations
Altmetric:
Publication Date
2025-08-10
Type
conference paper
Downloads
Citation
Aribilola, I., Catena, M., Asghar, M., Breslin, J., Delbru, R. (2025). Building Realistic Ground Truth Datasets of Personal Identification Information for Entity Matching. In: Coppens, B., Volckaert, B., Naessens, V., De Sutter, B. (eds) Availability, Reliability and Security. ARES 2025. Lecture Notes in Computer Science, vol 15997. Springer, Cham. https://doi.org/10.1007/978-3-032-00639-4_12
Abstract
Entity matching (EM) is essential for connecting data across sources, particularly in sensitive domains like human trafficking investigations. However, research faces a critical gap: the lack of realistic gold standard datasets containing personal identifying information. This paper introduces a methodology for creating gold standard datasets, demonstrated through the development of a representative dataset for personal identification information (PII). Our approach combines multiple EM techniques to identify candidate matches, followed by a systematic annotation and validation process. Notably, our findings demonstrate that different techniques identify largely non-overlapping sets of matches, validating the need for our multi-technique methodology. Our approach provides a reproducible template for creating gold standard datasets in domains where realistic evaluation resources are scarce.
Publisher
Springer
Publisher DOI
Rights
CC BY-NC-ND