Loading...
Thumbnail Image
Publication

Building realistic ground truth datasets of personal identification information for entity matching

Aribilola, Ifeoluwapo
Catena, Matteo
Asghar, Mamoona
Breslin, John
Delbru, Renaud
Citation
Aribilola, I., Catena, M., Asghar, M., Breslin, J., Delbru, R. (2025). Building Realistic Ground Truth Datasets of Personal Identification Information for Entity Matching. In: Coppens, B., Volckaert, B., Naessens, V., De Sutter, B. (eds) Availability, Reliability and Security. ARES 2025. Lecture Notes in Computer Science, vol 15997. Springer, Cham. https://doi.org/10.1007/978-3-032-00639-4_12
Abstract
Entity matching (EM) is essential for connecting data across sources, particularly in sensitive domains like human trafficking investigations. However, research faces a critical gap: the lack of realistic gold standard datasets containing personal identifying information. This paper introduces a methodology for creating gold standard datasets, demonstrated through the development of a representative dataset for personal identification information (PII). Our approach combines multiple EM techniques to identify candidate matches, followed by a systematic annotation and validation process. Notably, our findings demonstrate that different techniques identify largely non-overlapping sets of matches, validating the need for our multi-technique methodology. Our approach provides a reproducible template for creating gold standard datasets in domains where realistic evaluation resources are scarce.
Publisher
Springer
Publisher DOI
Rights
CC BY-NC-ND