Data Science Institute (Conference Papers)

Permanent URI for this collection

Browse

Recent Submissions

  • Publication
    Using information retrieval techniques to automatically repurpose existing dialogue datasets for safe chatbot development
    (ELRA Language Resource Association, 2024-05-21) Tunde, Oluwaseyi Ajayi; Negi, Gaurav; Arcan, Mihael; Buitelaar, Paul; Science Foundation Ireland; European Regional Development Fund
    There has been notable progress in the development of open-domain dialogue systems (chatbots) especially with the rapid advancement of the capabilities of Large Language Models. Chatbots excel at holding conversations in a manner that keeps a user interested and engaged. However, their responses can be unsafe, as they can respond in an offensive manner or offer harmful professional advice. As a way to mitigate this issue, recent work crowdsource datasets with exemplary responses or annotate dialogue safety datasets, which are relatively scarce compared to casual dialogues. Despite the quality of data obtained from crowdsourcing, it can be expensive and time consuming. This work proposes an effective pipeline, using information retrieval, to automatically repurpose existing dialogue datasets for safe chatbot development, as a way to address the aforementioned challenges. We select an existing dialogue dataset, revise its unsafe responses, as a way to obtain a dataset with safer responses to unsafe user inputs. We then fine-tune dialogue models on the original and revised datasets and generate responses to evaluate the safeness of the models
  • Publication
    Cross-lingual transfer and multilingual learning for detecting harmful behaviour in African under-resourced language dialogue
    (Association for Computational Linguistics, 2024-09-18) Ajayi, Tunde Oluwaseyi; Arcan, Mihael; Buitelaar, Paul; Science Foundation Ireland; European Regional Development Fund
    Most harmful dialogue detection models are de veloped for high-resourced languages. Consequently, users who speak under-resourced lan guages cannot fully benefit from these models in terms of usage, development, detection and mitigation of harmful dialogue utterances. Our work aims at detecting harmful utterances in under-resourced African languages. We lever age transfer learning using pretrained models trained with multilingual embeddings to de velop a cross-lingual model capable of detect ing harmful content across various African lan guages. We first fine-tune a harmful dialogue detection model on a selected African dialogue dataset. Additionally, we fine-tune a model on a combined dataset in some African lan guages to develop a multilingual harmful dia logue detection model. We then evaluate the cross-lingual model’s ability to generalise to an unseen African language by performing harm ful dialogue detection in an under-resourced language not present during pretraining or fine tuning. We evaluate our models on the test datasets. We show that our best performing models achieve impressive results in terms of F1 score. Finally, we discuss the results and limitations of our work.
  • Publication
    Enabling dataspaces using foundation models: Technical, legal and ethical considerations and future trends
    (IEEE, 2024-01-22) Timilsina, Mohan; Buosi, Samuele; Song, Ping; Yang, Yang; Haque, Rafiqul; Curry, Edward
    Foundation Models are pivotal in advancing arti ficial intelligence, driving notable progress across diverse areas. When merged with dataspace, these models enhance our capabil ity to develop algorithms that are powerful, predictive, and honor data sovereignty and quality. This paper highlights the potential benefits of a comprehensive repository of Foundation Models, contextualized within dataspace. Such an archive can streamline research, development, and education by offering a comparative analysis of various models and their applications. While serving as a consistent reference point for model assessment and fostering collaborative learning, the repository does face challenges like un biased evaluations, data privacy, and comprehensive information delivery. The paper also notes the importance of the repository being globally applicable, ethically constructed, and user-friendly. We delve into the nuances of integrating Foundation Models within dataspace, balancing the repository’s strengths against its limitations.
  • Publication
    Machine learning survival models for relapse prediction in a early stage lung cancer patient
    (IEEE, 2023-08-02) Timilsina, Mohan; Buosi, Samuele; Janik, Adrianna; Minervini, Pasquale; Costabello, Luca; Torrente, Maria; Provencio, Mariano; Calvo, Virginia; Camps, Carlos; Ortega, Ana L.; Garcia Campelo, M.Rosario; del Barco, Edel; Bosch-Barrera, Joaquim; Nováček, Vít; Massutí, Bartomeu
    Lung cancer is one of the leading health complica tions causing high mortality worldwide. The relapsing behavior of medically treated early-stage lung cancer makes this disease even more complicated. Thus predicting such relapse using a data-centric approach provides a complementary perspective for clinicians to understand the disease. In this preliminary work, we explored off-the-shelf survival models to predict the relapse of early-stage lung cancer patients. We analyzed the survival models on a cohort of 1348 early-stage non-small cell lung cancer (NSCLC) patients in different timestamps. Using the prediction explanation model SHAP (SHapley Additive exPlanations), we further explained the best-performing survival model’s predic tions. Our explainable predictive model is a potential tool for oncologists that address an unmet clinical need for post-treatment patient stratification based on the relapse hazard.
  • Publication
    Knowledge graphs, clinical trials, dataspace, and AI: Uniting for progressive healthcare innovation
    (IEEE, 2024-01-22) Timilsina, Mohan; Alsamhi, Saeed; Haque, Rafiqul; Judge, Conor; Curry, Edward
    Amidst prevailing healthcare challenges, a dynamic solution emerges, fusing knowledge graph technology, clinical tri als optimization, dataspace integration, and AI innovation. This unified approach tackles issues like limited patient insights, sub optimal trial designs, and imprecise treatments. By interlinking diverse data through knowledge graphs, this method illuminates disease trends, therapeutic efficacies, and patient prognoses. AI techniques, especially machine learning, contribute predictive power by unveiling hidden patterns for accurate diagnostics, prognostics, and personalized treatments. This multidisciplinary fusion transforms clinical trials, enhancing comprehensiveness and precision through real-world data analysis and subgroup identification. In reshaping healthcare, this proposition aims to accelerate treatment personalization, elevate therapeutic efficacy, and empower informed medical decisions, encompassing the essence of ’Advancing Healthcare through Innovation: Knowl edge Graphs, Clinical Trials, Dataspace, and AI’.
  • Publication
    An SRAM optimized approach for constant memory consumption and ultra-fast execution of ML classifiers on TinyML hardware
    (Institute of Electrical and Electronics Engineers, 2021-11-15) Sudharsan, Bharath; Yadav, Piyush; Breslin, John G.; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    With the introduction of ultra-low-power machine learning (TinyML), IoT devices are becoming smarter as they are driven by Machine Learning (ML) models. However, any increase in the training data results in a linear increase in the space complexity of the ML models. It is highly challenging to deploy such ML models on IoT devices with limited memory (TinyML hardware). To alleviate such memory issues, in this paper, we present an SRAM-optimized classifier porting, stitching, and efficient deployment approach. The proposed method enables large classifiers to be comfortably executed on microcontroller unit (MCU) based IoT devices and perform ultra-fast classifications while consuming 0 bytes of SRAM. We tested our SRAM optimized approach by utilizing it to port and execute 7 dataset-trained classifiers on 7 popular MCU boards, and report their inference time and memory (Flash and SRAM) consumption. It is apparent from the experimental results that; (i) the classifiers ported using our proposed approach are of varied sizes but have constant SRAM consumption. Thus, the approach enabled the deployment of larger ML classifier models even on tiny Atmega328P MCU-based Arduino Nano, which has only 8 kB SRAM; (ii) even the resource-constrained 8-bit MCUs performed faster unit inference (in less than a millisecond) than a NVIDIA Jetson Nano GPU and Raspberry Pi 4 CPU; (iii) the majority of models produced 1-4x times faster inference results in comparison with the models ported by the sklearn-porter, m2cgen, and emlearn libraries.
  • Publication
    CURED4NLG: A dataset for table-to-text generation
    (University of Galway, 2023) Pasricha, Nivranshu; Arcan, Mihael; Buitelaar, Paul; Science Foundation Ireland; European Regional Development Fund
    We introduce CURED4NLG, a dataset for the task of table-to-text generation focusing on the public health domain. The dataset consists of 280 pairs of tables and documents extracted from weekly epidemiological reports published by the World Health Organisation (WHO). The tables report the number of cases and deaths from COVID-19, while the documents describe global and regional updates in English text. Along with the releasing the dataset, we present outputs from three different baselines for the task of table-to-text generation. The first is based on a manually defined template and the other two on end-to-end transformer-based models. Our results suggest that end-to-end models can learn a templatelike structure of the reports to produce fluent sentences, but may contain many factual errors especially related to numerical values.
  • Publication
    Unsupervised deep language and dialect identification for short texts
    (International Committee on Computational Linguistics, 2020-12) Goswami, Koustava; Sarkar, Rajdeep; Chakravarthi, Bharathi Raja; Fransen, Theodorus; McCrae, John P.; Irish Research Council; Science Foundation Ireland
    Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.
  • Publication
    Cross-lingual sentence embedding using multi-task learning
    (Association for Computational Linguistics, 2021-11-07) Goswami, Koustava; Dutta, Sourav; Assem, Haytham; Fransen, Theodorus; McCrae, John P.; Irish Research Council; Science Foundation Ireland
    Multilingual sentence embeddings capture rich semantic information not only for measuring similarity between texts but also for catering to a broad range of downstream cross-lingual NLP tasks. State-of-the-art multilingual sentence embedding models require large parallel corpora to learn efficiently, which confines the scope of these models. In this paper, we propose a novel sentence embedding framework based on an unsupervised loss function for generating effective multilingual sentence embeddings, eliminating the need for parallel corpora. We capture semantic similarity and relatedness between sentences using a multi-task loss function for training a dual encoder model mapping different languages onto the same vector space. We demonstrate the efficacy of an unsupervised as well as a weakly supervised variant of our framework on STS, BUCC and Tatoeba benchmark tasks. The proposed unsupervised sentence embedding framework outperforms even supervised state-of-the-art methods for certain under-resourced languages on the Tatoeba dataset and on a monolingual benchmark. Further, we show enhanced zero-shot learning capabilities for more than 30 languages, with the model being trained on only 13 languages. Our model can be extended to a wide range of languages from any language family, as it overcomes the requirement of parallel corpora for training.
  • Publication
    NUIG at TIAD 2021: Cross-lingual word embeddings for translation inference
    (2021-09-01) Ahmadi, Sina; Ojha, Atul Kr.; Banerjee, Shubhanker; McCrae, John P.; Horizon 2020
    Inducing new translation pairs across dictionaries is an important task that facilitates processing and maintaining lexicographical data. This paper describes our submissions to the Translation Inference Across Dictionaries (TIAD) shared task of 2021. Our systems mainly rely on the MUSE and VecMap cross-lingual word embedding mapping to create new translation pairs between English, French and Portuguese data. We also create two regression models based on the graph analysis features. Our systems perform above the baseline systems.
  • Publication
    Do city dashboards make sense? Conceptualising user experiences and challenges in using city dashboards. A case study
    (Association for Computing Machinery (ACM), 2021-06-09) Vornhagen, Heike; Zarrouk, Manel; Davis, Brian; Young, Karen; Science Foundation Ireland
    City dashboards present information about a city to a broad audience with some thought given as to how some of these audiences might understand the information. However, little research has looked at how ’citizens’ make sense of dashboards. Using two sample dashboards, we asked community activists from four different areas (Health, Environment, Transport and Agriculture) to explore the information displayed. Using grounded theory approaches, we looked at factors which support or hinder users sense-making. From further analysis of the data we identify four key challenges that need to be addressed to support users making sense of city dashboards: lack of support given for understanding the information and data presented, lack of possibilities for users to engage, lack of purpose, and a lack of governance. We recommend a series of design and development actions for city dashboards creators for each challenge area. The desire to give access to open data through dashboards requires a considerable investment of time and resources - an investment that is wasted if dashboards are not useful to their users (citizens) and, as a result, are not used.
  • Publication
    Understanding my city through dashboards. How hard can it be?
    (International Federation for Information Processing (IFIP) and eJournal of eDemocracy and Open Government (JeDEM), 2019-09-02) Vornhagen, Heike; Young, Karen; Zarrouk, Manel; Science Foundation Ireland; Horizon 2020
    This paper describes research into how current city dashboards support users' sensemaking processes. It uses criteria identified in previous research concerning visualisation and applies these to a number of city dashboards that are publicly available and hence considered to be seen as a potential communication tool. The paper briefly describes the context regarding dashboards and gives a broad overview of how visualisation design can affect sense-making processes. Finally it lists the initial results of a 'at a glance'-style review according to a number of sense-making criteria.
  • Publication
    Traffic prediction framework for OpenStreetMap using deep learning based complex event processing and open traffic cameras
    (Dagstuhl Research Online Publication Server (DROPS), 2020-09-25) Yadav, Piyush; Sarkar, Dipto; Salwala, Dhaval; Curry, Edward; Science Foundation Ireland
    Displaying near-real-time traffic information is a useful feature of digital navigation maps. However, most commercial providers rely on privacy-compromising measures such as deriving location information from cellphones to estimate traffic. The lack of an open-source traffic estimation method using open data platforms is a bottleneck for building sophisticated navigation services on top of OpenStreetMap (OSM). We propose a deep learning-based Complex Event Processing (CEP) method that relies on publicly available video camera streams for traffic estimation. The proposed framework performs near-real-time object detection and objects property extraction across camera clusters in parallel to derive multiple measures related to traffic with the results visualized on OpenStreetMap. The estimation of object properties (e.g. vehicle speed, count, direction) provides multidimensional data that can be leveraged to create metrics and visualization for congestion beyond commonly used density-based measures. Our approach couples both flow and count measures during interpolation by considering each vehicle as a sample point and their speed as weight. We demonstrate multidimensional traffic metrics (e.g. flow rate, congestion estimation) over OSM by processing 22 traffic cameras from London streets. The system achieves a near-real-time performance of 1.42 seconds median latency and an average F-score of 0.80.
  • Publication
    RCE-NN: a five-stage pipeline to execute neural networks (CNNs) on resource-constrained IoT edge devices
    (Association for Computing Machinery (ACM), 2020-10-06) Sudharsan, Bharath; Breslin, John G.; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    Microcontroller Units (MCUs) in edge devices are resource constrained due to their limited memory footprint, fewer computation cores, and low clock speeds. These limitations constrain one from deploying and executing machine learning models on MCUs. To fit, deploy and execute Convolutional Neural Networks (CNNs) for any IoT use-case on small MCUs, a complete design flow is required. Resource Constrained Edge - Neural Networks (RCE-NN) is the name given to our proposed design flow, with a five-stage pipeline that developers can follow for executing CNNs on MCUs. In this pipeline, the initial model architecture and training stage consists of four well-defined tasks on model size, workload, operations and quantization awareness, which maps the desired CNN as captured in an executable specification to a resource-constrained MCU's specification. The next quantization and conversion stage reduces model size, saves memory, and simplifies calculations without much impact on the accuracy. In the third stage, the quantized version of the model is translated into a c-byte array since the MCUs lack native file-system support. The translated c-byte array is fused with the main program of an IoT use-case and binaries are built using techniques from the fourth stage. Finally, the method presented in the last deployment stage is used to flash the built binaries onto MCUs, as this method allows the memory of the MCU to be fully utilized by the CNN and its operations. We evaluated RCE-NN using eight popular MCU boards. The results show that, when users realize all five pipeline stages, they can fit, deploy and execute multiple CNNs across multiple open-source MCU boards. The RCE-NN pipeline components quantize and compress the CNNs to 1/10th of their original size, enabling the CNNs to fit on MCUs with no or minimal loss in performance, both after quantization and compression, and also during runtime.
  • Publication
    Edge2Train: A framework to train machine learning models (SVMs) on resource-constrained IoT edge devices
    (Association for Computing Machinery (ACM), 2020-10-06) Sudharsan, Bharath; Breslin, John G.; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    In recent years, ML (Machine Learning) models that have been trained in data centers can often be deployed for use on edge devices. When the model deployed on these devices encounters unseen data patterns, it will either not know how to react to that specific scenario or result in a degradation of accuracy. To tackle this, in current scenarios, most edge devices log such unseen data in the cloud via the internet. Using this logged data, the initial ML model is then re-trained/upgraded in the data center and then sent to the edge device as an OTA (Over The Air) update. When applying such an online approach, the cost of edge devices increases due to the addition of wireless modules (4G or WiFi) and it also increases the cyber-security risks. Additionally, it also requires maintaining a continuous connection between edge devices and the cloud infrastructure leading to the requirement of high network bandwidth and traffic. Finally, such online devices are not self-contained ubiquitous systems. In this work, we provide Edge2Train, a framework which enables resource-scarce edge devices to re-train ML models locally and offline. Thus, edge devices can continuously improve themselves for better analytics results by managing to understand continuously evolving real-world data on the fly. In this work, we provide algorithms for Edge2Train along with its C++ implementations. Using these functions, on-board, offline SVM training, inference, and evaluation has been performed on five popular MCU boards. The results show that our Edge2Train-trained SVMs produce classification accuracy close to that of SVMs trained on high resource setups. It also performs unit inference for values with 64-dimensional features 3.5x times faster than CPUs, while consuming only 1/350th of the energy that CPUs consume.
  • Publication
    Enabling machine learning on the edge using SRAM conserving efficient neural networks execution approach
    (National University of Ireland Galway, 2021-09-13) Sudharsan, Bharath; Patel, Pankesh; Breslin, John G.; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    Edge analytics refers to the application of data analytics and Machine Learning (ML) algorithms on IoT devices. The concept of edge analytics is gaining popularity due to its ability to perform AI-based analytics at the device level, enabling autonomous decision-making, without depending on the cloud. However, the majority of Internet of Things (IoT) devices are embedded systems with a low-cost microcontroller unit (MCU) or a small CPU as its brain, which often are incapable of handling complex ML algorithms. In this paper, we propose an approach for the efficient execution of already deeply compressed, large neural networks (NNs) on tiny IoT devices. After optimizing NNs using state-of-the-art deep model compression methods, when the resultant models are executed by MCUs or small CPUs using the model execution sequence produced by our approach, higher levels of conserved SRAM can be achieved. During the evaluation for nine popular models, when comparing the default NN execution sequence with the sequence produced by our approach, we found that 1.61-38.06% less SRAM was used to produce inference results, the inference time was reduced by 0.28-4.9 ms, and energy consumption was reduced by 4-84 mJ. Despite achieving such high conserved levels of SRAM, our meth
  • Publication
    Ultra-fast machine learning classifier execution on IoT devices without SRAM consumption
    (Institute of Electrical and Electronics Engineers (IEEE), 2021-05-25) Sudharsan, Bharath; Patel, Pankesh; Breslin, John G.; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    With the introduction of edge analytics, IoT devices are becoming smart and ready for AI applications. A few modern ML frameworks are focusing on the generation of small-size ML models (often in kBs) that can directly be flashed and executed on tiny IoT devices, particularly the embedded systems. Edge analytics eliminates expensive device-to-cloud communications, thereby producing intelligent devices that can perform energy-efficient real-time offline analytics. Any increase in the training data results in a linear increase in the size and space complexity of the trained ML models, making them unable to be deployed on IoT devices with limited memory. To alleviate the memory issue, a few studies have focused on optimizing and fine-tuning existing ML algorithms to reduce their complexity and size. However, such optimization is usually dependent on the nature of IoT data being trained. In this paper, we presented an approach that protects model quality without requiring any alteration to the existing ML algorithms. We propose SRAM-optimized implementation and efficient deployment of widely used standard/stable ML-frameworks classifier versions (e.g., from Python scikit-learn). Our initial evaluation results have demonstrated that ours is the most resource-friendly approach, having a very limited memory footprint while executing large and complex ML models on MCU-based IoT devices, and can perform ultra-fast classifications while consuming 0 bytes of SRAM. When we tested our approach by executing it on a variety of MCU-based devices, the majority of models ported and executed produced 1-4x times faster inference results in comparison with the models ported by the sklearn-porter, m2cgen, and emlearn libraries.
  • Publication
    TinyML benchmark: Executing fully connected neural networks on commodity microcontrollers
    (National University of Ireland Galway, 2021-06-20) Sudharsan, Bharath; Salerno, Simone; Nguyen, Duc-Duy; Yahya, Muhammad; Wahid, Abdul; Yadav, Piyush; Breslin, John G.; Science Foundation Ireland; European Regional Development Fund
    Recent advancements in the field of ultra-low-power machine learning (TinyML) promises to unlock an entirely new class of edge applications. However, continued progress is restrained by the lack of benchmarking Machine Learning (ML) models on TinyML hardware, which is fundamental to this field reaching maturity. In this paper, we designed 3 types of fully connected Neural Networks (NNs), trained each NN using 10 datasets (produces 30 NNs), and present the benchmark by reporting the onboard model performance on 7 popular MCUboards (similar boards are used to design TinyML hardware). We open-sourced and made the complete benchmark results freely available online 1 to enable the TinyML community researchers and developers to systematically compare, evaluate, and improve various asp
  • Publication
    Uncovering semantic bias in neural network models using a knowledge graph
    (ACM, 2020-10-19) Nikolov, Andriy; d'Aquin, Mathieu
    While neural networks models have shown impressive performance in many NLP tasks, lack of interpretability is often seen as a disadvantage. Individual relevance scores assigned by post-hoc explanation methods are not sufficient to show deeper systematic preferences and potential biases of the model that apply consistently across examples. In this paper we apply rule mining using knowledge graphs in combination with neural network explanation methods to uncover such systematic preferences of trained neural models and capture them in the form of conjunctive rules. We test our approach in the context of text classification tasks and show that such rules are able to explain a substantial part of the model behaviour as well as indicate potential causes of misclassifications when the model is applied outside of the initial training context.
  • Publication
    Smart speaker design and implementation with biometric authentication and advanced voice interaction capability
    (CEUR-WS.org, 2019-12-05) Sudharsan, Bharath; Corcoran, Peter; Ali, Muhammad Intizar; Science Foundation Ireland; European Regional Development Fund
    Advancements in semiconductor technology have reduced dimensions and cost while improving the performance and capacity of chipsets. In addition, advancement in the AI frameworks and libraries brings possibilities to accommodate more AI at the resource-constrained edge of consumer IoT devices. Sensors are nowadays an integral part of our environment which provide continuous data streams to build intelligent applications. An example could be a smart home scenario with multiple interconnected devices. In such smart environments, for convenience and quick access to web-based service and personal information such as calendars, notes, emails, reminders, banking, etc, users link third-party skills or skills from the Amazon store to their smart speakers. Also, in current smart home scenarios, several smart home products such as smart security cameras, video doorbells, smart plugs, smart carbon monoxide monitors, and smart door locks, etc. are interlinked to a modern smart speaker via means of custom skill addition. Since smart speakers are linked to such services and devices via the smart speaker user’s account. They can be used by anyone with physical access to the smart speaker via voice commands. If done so, the data privacy, home security and other aspects of the user get compromised. Recently launched, Tensor Cam’s AI Camera, Toshiba’s Symbio, Facebook’s Portal are camera-enabled smart speakers with AI functionalities. Although they are camera-enabled, yet they do not have an authentication scheme in addition to calling out the wake-word. This paper provides an overview of cybersecurity risks faced by smart speaker users due to lack of authentication scheme and discusses the development of a state-of-the-art camera-enabled, microphone arraybased modern Alexa smart speaker prototype to address these risks. Keywords: Alexa Voice Service, Snowboy hotword detection, Smart speaker design, Microphone array, ReSpeaker, Voice algorithms, Open CV, Smart speaker authentication.