Data Science Institute (Scholarly Articles)

Permanent URI for this collection

Browse

Recent Submissions

  • Publication
    Intent classification by the use of automatically generated knowledge graphs
    (MDPI, 2023-05-12) Arcan, Mihael; Manjunath, Sampritha; Robin, Cécile; Verma, Ghanshyam; Pillai, Devishree; Sarkar, Simon; Dutta, Sourav; Assem, Haytham; McCrae, John P.; Buitelaar, Paul; Science Foundation Ireland
    Intent classification is an essential task for goal-oriented dialogue systems for automatically identifying customers¿ goals. Although intent classification performs well in general settings, domain-specific user goals can still present a challenge for this task. To address this challenge, we automatically generate knowledge graphs for targeted data sets to capture domain-specific knowledge and leverage embeddings trained on these knowledge graphs for the intent classification task. As existing knowledge graphs might not be suitable for a targeted domain of interest, our automatic generation of knowledge graphs can extract the semantic information of any domain, which can be incorporated within the classification process. We compare our results with state-of-the-art pre-trained sentence embeddings and our evaluation of three data sets shows improvement in the intent classification task in terms of precision.
  • Publication
    Towards an integrative approach for making sense distinctions
    (Frontiers Media, 2022-02-07) McCrae, John P.; Fransen, Theodorus; Ahmadi, Sina; Buitelaar, Paul; Goswami, Koustava; Simon De Deyne; Enterprise Ireland; Irish Research Council; Horizon 2020; Science Foundation Ireland
    Word senses are the fundamental unit of description in lexicography, yet it is rarely the case that different dictionaries reach any agreement on the number and definition of senses in a language. With the recent rise in natural language processing and other computational approaches there is an increasing demand for quantitatively validated sense catalogues of words, yet no consensus methodology exists. In this paper, we look at four main approaches to making sense distinctions: formal, cognitive, distributional, and intercultural and examine the strengths and weaknesses of each approach. We then consider how these may be combined into a single sound methodology. We illustrate this by examining two English words, ¿wing¿ and ¿fish,¿ using existing resources for each of these four approaches and illustrate the weaknesses of each. We then look at the impact of such an integrated method and provide some future perspectives on the research that is necessary to reach a principled method for making sense distinctions.
  • Publication
    Knowledge graph driven approach to represent video streams for spatiotemporal event pattern matching in complex event processing
    (World Scientific Publishing, 2020) Yadav, Piyush; Salwala, Dhaval; Das, Dibya Prakash; Curry, Edward; Science Foundation Ireland
    Complex Event Processing (CEP) is an event processing paradigm to perform real-time analytics over streaming data and match high-level event patterns. Presently, CEP is limited to process structured data stream. Video streams are complicated due to their unstructured data model and limit CEP systems to perform matching over them. This work introduces a graph-based structure for continuous evolving video streams, which enables the CEP system to query complex video event patterns. We propose the Video Event Knowledge Graph (VEKG), a graph-driven representation of video data. VEKG models video objects as nodes and their relationship interaction as edges over time and space. It creates a semantic knowledge representation of video data derived from the detection of high-level semantic concepts from the video using an ensemble of deep learning models. A CEP-based state optimization VEKG-Time Aggregated Graph (VEKG-TAG) is proposed over VEKG representation for faster event detection. VEKG-TAG is a spatiotemporal graph aggregation method that provides a summarized view of the VEKG graph over a given time length. We defined a set of nine event pattern rules for two domains (Activity Recognition and Traffic Management), which act as a query and applied over VEKG graphs to discover complex event patterns. To show the efficacy of our approach, we performed extensive experiments over 801 video clips across 10 datasets. The proposed VEKG approach was compared with other state-of-the-art methods and was able to detect complex event patterns over videos with F-Score ranging from 0.44 to 0.90. In the given experiments, the optimized VEKG-TAG was able to reduce 99% and 93% of VEKG nodes and edges, respectively, with 5.19X faster search time, achieving sub-second median latency of 4 20ms.
  • Publication
    Query-driven video event processing for the internet of multimedia things
    (VLDB Endowment, 2021-08) Yadav, Piyush; Salwala, Dhaval; Arruda Pontes, Felipe; Dhingra, Praneet; Curry, Edward; Proceedings of the VLDB Endowment; Science Foundation Ireland
    Advances in Deep Neural Network (DNN) techniques have revolutionized video analytics and unlocked the potential for querying and mining video event patterns. This paper details GNOSIS, an event processing platform to perform near-real-time video event detection in a distributed setting. GNOSIS follows a serverless approach where its component acts as independent microservices and can be deployed at multiple nodes. GNOSIS uses a declarative query-driven approach where users can write customize queries for spatiotemporal video event reasoning. The system converts the incoming video streams into a continuous evolving graph stream using machine learning (ML) and DNN models pipeline and applies graph matching for video event pattern detection. GNOSIS can perform both stateful and stateless video event matching. To improve Quality of Service (QoS), recent work in GNOSIS incorporates optimization techniques like adaptive scheduling, energy efficiency, and content-driven windows. This paper demonstrates the Occupational Health and Safety query use cases to show the GNOSIS efficacy
  • Publication
    VID-WIN: Fast video event matching with query-aware windowing at the edge for the internet of multimedia things
    (Institute of Electrical and Electronics Engineers (IEEE), 2021-04-23) Yadav, Piyush; Salwala, Dhaval; Curry, Edward
    Efficient video processing is a critical component in many IoMT applications to detect events of interest. Presently, many window optimization techniques have been proposed in event processing with an underlying assumption that the incoming stream has a structured data model. Videos are highly complex due to the lack of any underlying structured data model. Video stream sources, such as CCTV cameras and smartphones are resource-constrained edge nodes. At the same time, video content extraction is expensive and requires computationally intensive deep neural network (DNN) models that are primarily deployed at high-end (or cloud) nodes. This article presents VID-WIN, an adaptive 2-stage allied windowing approach to accelerate video event analytics in an edge-cloud paradigm. VID-WIN runs parallelly across edge and cloud nodes and performs the query and resource-aware optimization for state-based complex event matching. VID-WIN exploits the video content and DNN input knobs to accelerate the video inference process across nodes. This article proposes a novel content-driven microbatch resizing , query-aware caching, and microbatch-based utility filtering strategy of video frames under resource-constrained edge nodes to improve the overall system throughput, latency, and network usage. Extensive evaluations are performed over five real-world data sets. The experimental results show that VID-WIN video event matching achieves ∼2.3× higher throughput with minimal latency and ~99% bandwidth reduction compared to other baselines while maintaining query-level accuracy and resource bounds.
  • Publication
    Toward distributed, global, deep learning using IoT devices
    (Institute of Electrical and Electronics Engineers (IEEE), 2021-07-20) Sudharsan, Bharath; Patel, Pankesh; Breslin, John; Ali, Muhammad Intizar; Mitra, Karan; Dustdar, Schahram; Rana, Omer; Jayaraman, Prem Prakash; Ranjan, Rajiv; Horizon 2020; Science Foundation Ireland; European Regional Development Fund
    Deep learning (DL) using large scale, high-quality IoT datasets can be computationally expensive. Utilizing such datasets to produce a problem-solving model within a reasonable time frame requires a scalable distributed training platform/system. We present a novel approach where to train one DL model on the hardware of thousands of mid-sized IoT devices across the world, rather than the use of GPU cluster available within a data center. We analyze the scalability and model convergence of the subsequently generated model, identify three bottlenecks that are: high computational operations, time consuming dataset loading I/O, and the slow exchange of model gradients. To highlight research challenges for globally distributed DL training and classification, we consider a case study from the video data processing domain. A need for a two-step deep compression method, which increases the training speed and scalability of DL training processing, is also outlined. Our initial experimental validation shows that the proposed method is able to improve the tolerance of the distributed training process to varying internet bandwidth, latency, and Quality of Service metrics.
  • Publication
    Synergy between embedding and protein functional association networks for drug label prediction using harmonic function
    (ACM and IEEE, 2020-10-16) Timilsina, Mohan; Mc Kernan, Declan Patrick; Yang, Haixuan; d’Aquin, Mathieu; Science Foundation Ireland
    Semi-Supervised Learning (SSL) is an approach to machine learning that makes use of unlabeled data for training with a small amount of labeled data. In the context of molecular biology and pharmacology, one can take advantage of unlabeled data. For instance, to identify drugs and targets where a few genes are known to be associated with a specific target for drugs and considered as labeled data. Labeling the genes requires laboratory verification and validation. This process is usually very time consuming and expensive. Thus, it is useful to estimate the functional role of drugs from unlabeled data using computational methods. To develop such a model, we used openly available data resources to create (i) drugs and genes, (ii) genes and disease, bipartite graphs. We constructed the genetic embedding graph from the two bipartite graphs using Tensor Factorization methods. We integrated the genetic embedding graph with the publicly available genetic interaction graphs. Our results show the usefulness of the integration by effectively predicting drug labels.
  • Publication
    Biological applications of knowledge graph embedding models
    (Oxford University Press (OUP), 2020-02-17) Mohamed, Sameh K.; Nounu, Aayah; Nováček, Vít; Horizon 2020; Science Foundation Ireland
    Complex biological systems are traditionally modelled as graphs of interconnected biological entities. These graphs, i.e. biological knowledge graphs, are then processed using graph exploratory approaches to perform different types of analytical and predictive tasks. Despite the high predictive accuracy of these approaches, they have limited scalability due to their dependency on time-consuming path exploratory procedures. In recent years, owing to the rapid advances of computational technologies, new approaches for modelling graphs and mining them with high accuracy and scalability have emerged. These approaches, i.e. knowledge graph embedding (KGE) models, operate by learning low-rank vector representations of graph nodes and edges that preserve the graph s inherent structure. These approaches were used to analyse knowledge graphs from different domains where they showed superior performance and accuracy compared to previous graph exploratory approaches. In this work, we study this class of models in the context of biological knowledge graphs and their different applications. We then show how KGE models can be a natural fit for representing complex biological knowledge modelled as graphs. We also discuss their predictive and analytical capabilities in different biology applications. In this regard, we present two example case studies that demonstrate the capabilities of KGE models: prediction of drug target interactions and polypharmacy side effects. Finally, we analyse different practical considerations for KGEs, and we discuss possible opportunities and challenges related to adopting them for modelling biological systems.
  • Publication
    A decade of Semantic Web research through the lenses of a mixed methods approach
    (IOS Press, 2019-06-20) Kirrane, Sabrina; Sabou, Marta; Fernandez, Javier D.; Osborne, Francesco; Robin, Cécile; Buitelaar, Paul; Motta, Enrico; Polleres, Axel
    The identification of research topics and trends is an important scientometric activity, as it can help guide the direction of future research. In the Semantic Web area, initially topic and trend detection was primarily performed through qualitative, top-down style approaches, that rely on expert knowledge. More recently, data-driven, bottom-up approaches have been proposed that offer a quantitative analysis of the evolution of a research domain. In this paper, we aim to provide a broader and more complete picture of Semantic Web topics and trends by adopting a mixed methods methodology, which allows for the combined use of both qualitative and quantitative approaches. Concretely, we build on a qualitative analysis of the main seminal papers, which adopt a top-down approach, and on quantitative results derived with three bottom-up data-driven approaches (Rexplore, Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the latter for “fact-checking” on the former and also to derive key findings in relation to the strengths and weaknesses of top-down and bottom up approaches to research topic identification. Although we provide a detailed study on the past decade of Semantic Web research, the findings and the methodology are relevant not only for our community but beyond the area of the Semantic Web to other research fields as well.
  • Publication
    Discovering protein drug targets using knowledge graph embeddings
    (Oxford University Press, 2019-08-01) Mohamed, Sameh K.; Nováček, Vít; Nounu, Aayah; Science Foundation Ireland; European Regional Development Fund
    Motivation Computational approaches for predicting drug-target interactions (DTIs) can provide valuable insights into the drug mechanism of action. DTI predictions can help to quickly identify new promising (on-target) or unintended (off-target) effects of drugs. However, existing models face several challenges. Many can only process a limited number of drugs and/or have poor proteome coverage. The current approaches also often suffer from high false positive prediction rates. Results We propose a novel computational approach for predicting drug target proteins. The approach is based on formulating the problem as a link prediction in knowledge graphs (robust, machine-readable representations of networked knowledge). We use biomedical knowledge bases to create a knowledge graph of entities connected to both drugs and their potential targets. We propose a specific knowledge graph embedding model, TriModel, to learn vector representaions (i.e. embeddings) for all drugs and targets in the created knowledge graph. These representations are consequently used to infer candidate drug target interactions based on their scores computed by the trained TriModel model. We have experimentally evaluated our method using computer simulations and compared it to five existing models. This has shown that our approach outperforms all previous ones in terms of both area under ROC and precision-recall curves in standard benchmark tests. Availability The data, predictions, and models are available at: drugtargets.insight-centre.org
  • Publication
    One size does not fit all: querying web polystores
    (IEEE, 2019-01-17) Khan, Yasar; Zimmermann, Antoine; Jha, Alokkumar; Gadepally, Vijay; d'Aquin, Mathieu; Sahay, Ratnesh
    Data retrieval systems are facing a paradigm shift due to the proliferation of specialized data storage engines (SQL, NoSQL, Column Stores, MapReduce, Data Stream, and Graph) supported by varied data models (CSV, JSON, RDB, RDF, and XML). One immediate consequence of this paradigm shift results into data bottleneck over the web; which means, web applications are unable to retrieve data with the intensity at which data are being generated from different facilities. Especially in the genomics and healthcare verticals, data are growing from petascale to exascale, and biomedical stakeholders are expecting seamless retrieval of these data over the web. In this paper, we argue that the bottleneck over the web can be reduced by minimizing the costly data conversion process and delegating query performance and processing loads to the specialized data storage engines over their native data models. We propose a web-based query federation mechanism—called PolyWeb—that unifies query answering over multiple native data models (CSV, RDB, and RDF). We emphasize two main challenges of query federation over native data models: 1) devise a method to select prospective data sources—with different underlying data models—that can satisfy a given query and 2) query optimization, join, and execution over different data models. We demonstrate PolyWeb on a cancer genomics use case, where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, and pathways) spans across multiple data models and respective storage engines. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with the state-of-the-art query federation engines in terms of result completeness, source selection, and overall query execution time.
  • Publication
    LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation
    (Elsevier, 2018-01-12) Saleem, Muhammad; Hasnain, Ali; Ngonga Ngomo, Axel-Cyrille
    Gathering information from the distributed Web of Data is commonly carried out by using SPARQL query federation approaches. However, the fitness of current SPARQL query federation approaches for real applications is difficult to evaluate with current benchmarks as they are either synthetic, too small in size and complexity or do not provide means for a fine-grained evaluation. We propose LargeRDFBench, a billion-triple benchmark for SPARQL query federation which encompasses real data as well as real queries pertaining to real bio-medical use cases. We evaluate state-of-the-art SPARQL endpoint federation approaches on this benchmark with respect to their query runtime, triple pattern-wise source selection, number of endpoints requests, and result completeness and correctness. Our evaluation results suggest that the performance of current SPARQL query federation systems on simple queries (in terms of total triple patterns, query result set sizes, execution time, use of SPARQL features etc.) does not reflect the systems performance on more complex queries. Moreover, current federation systems seem unable to deal with real queries that involve processing large intermediate result sets or lead to large result sets.
  • Publication
    A random walk model for entity relatedness
    (Springer Verlag, 2018-10-31) Torres-Tramón, Pablo; Hayes, Conor; Science Foundation Ireland; European Regional Development Fund
    Semantic relatedness is a critical measure for a wide variety of applications nowadays. Numerous models, including path-based, have been proposed for this task with great success in many applications during the last few years. Among these applications, many of them require computing semantic relatedness between hundreds of pairs of items as part of their regular input. This scenario demands a computationally efficient model to process hundreds of queries in short time spans. Unfortunately, Path-based models are computationally challenging, creating large bottlenecks when facing these circumstances. Current approaches for reducing this computation have focused on limiting the number of paths to consider between entities.
  • Publication
    MixedEmotions: An open-source toolbox for multi-modal emotion analysis
    (IEEE, 2018-01-25) Buitelaar, Paul; Wood, Ian D.; Negi, Sapna; Arcan, Mihael; McCrae, John P.; Abele, Andrejs; Robin, Cécile; Andryushechkin, Vladimir; Ziad, Housam; Sagha, Hesam; Schmitt, Maximilian; Schuller, Björn W.; Sánchez-Rada, J. Fernando; Iglesias, Carlos A.; Navarro, Carlos; Giefer, Andreas; Heise, Nicolaus; Masucci, Vincenzo; Danza, Francesco A.; Caterino, Ciro; Smrž, Pavel; Hradiš, Michal; Povolný, Filip; Klimeš, Marek; Matějka, Pavel; Tummarello, Giovanni
    Recently, there is an increasing tendency to embed the functionality of recognizing emotions from the user generated contents, to infer richer profile about the users or contents, that can be used for various automated systems such as call-center operations, recommendations, and assistive technologies. However, to date, adding this functionality was a tedious, costly, and time consuming effort, and one should look for different tools that suits one's needs, and should provide different interfaces to use those tools. The MixedEmotions toolbox leverages the need for such functionalities by providing tools for text, audio, video, and linked data processing within an easily integrable plug-and-play platform. These functionalities include: (i) for text processing: emotion and sentiment recognition, (ii) for audio processing: emotion, age, and gender recognition, (iii) for video processing: face detection and tracking, emotion recognition, facial landmark localization, head pose estimation, face alignment, and body pose estimation, and (iv) for linked data: knowledge graph. Moreover, the MixedEmotions Toolbox is open-source and free. In this article, we present this toolbox in the context of the existing landscape, and provide a range of detailed benchmarks on standardized test-beds showing its state-of-the-art performance. Furthermore, three real-world use-cases show its effectiveness, namely emotion-driven smart TV, call center monitoring, and brand reputation analysis.
  • Publication
    The colloquial WordNet: Extending Princeton WordNet with neologisms
    (Springer International Publishing, 2017-05-27) McCrae, John P.; Wood, Ian D.; HIcks, Amanda; Science Foundation Ireland; National Institutes of Health
    Princeton WordNet is one of the most important resources for natural language processing, but has not been updated for over ten years and is not suitable for analyzing the fast moving language as used on social media. We propose an extension to WordNet, with new terms that have been found from Twitter and Reddit, and cover language usage that is emergent or vulgar. In addition to our methodology for extraction, we analyze new terms to provide information about how new words are entering the English language. Finally, we discuss publishing this resource both as linguistic linked open data and as part of the Global WordNet Association’s Interlingual Index.
  • Publication
    Privacy, security and policies: A review of problems and solutions with semantic web technologies
    (IOS Press, 2018) Kirrane, Sabrina; Villata, Serena; d’Aquin, Mathieu
    Semantic Web technologies aim to simplify the distribution, sharing and exploitation of information and knowledge, across multiple distributed actors on the Web. As with all technologies that manipulate information, there are privacy and security implications, and data policies (e.g., licenses and regulations) that may apply to both data and software artifacts. Additionally, semantic web technologies could contribute to the more intelligent and flexible handling of privacy, security and policy issues, through supporting information integration and sense-making. In order to better understand the scope of existing work on this topic we examine 78 articles from dedicated venues, including this special issue, the PrivOn workshop series, two SPOT workshops, as well as the broader literature that connects the Semantic Web research domain with issues relating to privacy, security and/or policies. Specifically, we classify each paper according to three taxonomies (one for each of the aforementioned areas), in order to identify common trends and research gaps. We conclude by summarising the strong focus on relevant topics in Semantic Web research (e.g. information collection, information processing, policies and access control), and by highlighting the need to further exp
  • Publication
    Facilitating scientometrics in learning analytics and educational data mining - The LAK dataset
    (IOS Press, 2016-11-06) Dietze, Stefan; Taibi, Davide; d’Aquin, Mathieu
    The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not yet available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, being a near-complete corpus of research publications of a particular discipline, it facilitates scientometric investigations, for instance, about the evolution of a scientific field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications.
  • Publication
    Abstract A27: A linked data approach to discover HPV oncoprotiens and RB1 induced mutation associations for the retinoblastoma research
    (American Association for Cancer Research, 2017-01) Jha, Alokkumar; Khan, Yasar; Rebholz-Schumann, Dietrich; Sahay, Ratnesh
    Background: LOSS or GAIN in tumor suppressor gene RB1 play a significant role as in case of loss low penetrance where only 39% of eye at risk develops in retinoblastoma. This research covers the multiple mutation types and its effects and identification of major type of mutation involved in retinoblastoma because of HPV and RB1.
  • Publication
    A linked data visualiser for finite element biosimulations
    (World Scientific Publishing, 2016) Mehdi, Muntazir; Khan, Yasar; Jares, Joao; Freitas, Andre; Jha, Alok Kumar; Sakellarios, Antonis; Sahay, Ratnesh; Science Foundation Ireland; FP7 Information and Communication Technologies
    Biosimulation models are used to understand the multiple or different causative factors that cause impairment in human organs. Finite Element Method (FEM) provide a mathematical framework to simulate dynamic biological systems, with applications ranging from human ear, cardiovascular, to neurovascular research. Finite Element (FE) Biosimulation experiments produce huge amounts of numerical data. Visualising and analysing this huge numerical biosimulation data is a strenuous task. In this paper, we present a Linked Data Visualiser–called SIFEM Visualiser–to help domain-experts (experts in the field of ear mechanics) and clinical practitioners (otorhinolaryngologists) to Visualise, analyse and compare biosimulation results from heterogeneous, complex, and high volume numerical data. The SIFEM visualiser builds on conceptualising different aspects of biosimulations. In addition to the visualiser, we also propose how biosimulation numerical data can be conceptualised, such that it sustains the visualisation of large numerical data. The SIFEM Visualiser aims to help domain scientists and clinical practitioners exploring and analysing Finite Element (FE) numerical data and simulation results obtained from different aspects of inner ear (Cochlear) model — such as biological, geometrical, mathematical, and physical models. We validate the SIFEM Visualiser in both dimensions of qualitative and quantitative evaluation.
  • Publication
    Towards precision medicine: discovering novel gynecological cancer biomarkers and pathways using linked data
    (BioMed Central, 2017-09-19) Jha, Alokkumar; Khan, Yasar; Mehdi, Muntazir; Karim, Md Rezaul; Mehmood, Qaiser; Zappa, Achille; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh; Science Foundation Ireland
    Next Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation.