Publication

Entity summarisation for entity-centric publish/subscribe systems

Pavlopoulou, Niki
Citation
Abstract
The Internet of Things (IoT) has contributed to physical devices generating entity-centric data (e.g. smart buildings). To bridge the gap between the devices’ data and the users’ interests, Publish/Subscribe systems (Pub/Sub) are suitable middleware to deal with dynamic large-scale IoT applications due to their decoupling traits. However, the IoT contains more challenges than dynamism related to data and users. Specifically, data can be voluminous and heterogeneous due to integration or enrichment as well as redundant or semantically similar due to the sensors’ spatial proximity. Existing approaches tackle semantic interoperability through ontologies and taxonomies resulting in rigidness, non-scalability, and domain-dependency. At the same time, users can either create representationally-coupled queries that could be complex (e.g. SPARQL), independent of their data knowledge and expertise, or simple queries that lead to redundant information, which can overwhelm them. Existing approaches either use complex queries or create high-level data abstractions that are either not usable or complex for dynamic environments and suffer from representational coupling. This thesis addresses these problems and analyses two research questions involving the formulation of a new Pub/Sub scheme; the Entity-centric Publish/Subscribe Summarisation System that involves user-friendly and contextually-aware subscriptions as well as extractive and abstractive summarisation approaches for the publications. Its goal is to address usability, user expressibility, data expressiveness, user and data effectiveness, and system efficiency. Three approaches are proposed; PubSum, IoTSAX, and PoSSUM. PubSum is a dynamic diverse entity summarisation of heterogeneous Linked Data streams through windowing policies, embedding-based DBSCAN clustering, and geometric-based top-k ranking. IoTSAX is a dynamic abstractive summarisation of heterogeneous numerical entity graph streams through enhanced Symbolic Aggregate approximation (SAX) and approximate rule-based reasoning. PoSSUM is an extractive and abstractive diverse summarisation of heterogeneous numerical and Linked Data streams through novel partly-incremental conceptual clustering based on embedding models and variance as well as contextual-based top-k ranking. As an example, doctors are not experts in query languages and are unaware of the content and representations of patient data in a system. The proposed system will require a simple patient-centric subscription that will create a summary as a notification. This summary will be abstractive by interpreting the shape of real-time health sensor readings and providing a high-level inference as well as extractive by including the most important and conceptually/contextually diverse information coming from external sources (e.g. personal information). The proposed system has been extensively evaluated by synthetic and real-world data from the domains of Healthcare and Smart Cities achieving comparable results in correctness and system performance. Specifically, PubSum, involving DBpedia data, achieves up to 92% reduction of forwarded messages, 69.3% duplication reduction, and 0.95 redundancy-aware F-score compared to traditional Pub/Sub, but at the expense of 4 times more latency, while achieving 6 times less latency and 3 times less memory compared to the state-of-the-art diverse entity summarisation with throughput ranging from 833 to 1,005 events/second. IoTSAX, involving real-world heterogeneous data related to Healthcare and Smart Cities, achieves up to 0.87 reasoning F-score, 98% reduction of forwarded messages, and outperforms the original SAX in approximation error (2 to 3 times less) and compression space-saving percentage when data redundancy occurs (from 71.75% to 94.99%) while maintaining similar or better latency and throughput. The latency is 2 to 3 times more compared to traditional Pub/Sub and the throughput ranges from 13.231 to 97.393 events/second. PoSSUM, involving real-world heterogeneous data, discovers up to 80% data diversity desire by users and achieves the best summary quality for more than half of the entities as well as the best conceptual clustering F-score from 0.69 to 0.83 compared to traditional Pub/Sub and the state-of-the-art diverse entity summarisation. Also, up to 0.95 redundancy-aware F-score and 99% message reduction compared to traditional Pub/Sub. Finally, it has less clustering processing time, scoring and memory consumption, and comparable latency and throughput.
Publisher
NUI Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland
CC BY-NC-ND 3.0 IE