On-device learning, optimization, efficient deployment and execution of machine learning algorithms on resource-constrained IoT hardware

Sudharsan, Bharath
Edge analytics refers to the application of data analytics and Machine Learning (ML) algorithms on IoT devices. The concept of edge analytics is gaining popularity due to its ability to perform AI-based analytics at the device level, enabling autonomous decisionmaking without depending on the cloud. However, the majority of Internet of Things (IoT) devices are embedded systems (hardware) with a low-cost microcontroller unit (MCU) or a small CPU as its brain, which often are incapable of handling complex ML algorithms. This thesis aims to improve the intelligence of such resource-constrained IoT devices by providing novel algorithms, frameworks, strategies to: create self-learning ML-based IoT devices; efficiently deploy and execute a range of Neural Networks (NNs) and also non- NN ML algorithms on IoT devices; enable practicing communication efficient distributed ML using IoT devices. The memory footprint (SRAM, Flash, and EEPROM) of MCU-based devices is often very limited, restricting onboard ML model training for large trainsets with high feature dimensions. To cope with memory issues, the current edge analytics approaches train highquality ML models on the cloud GPUs (uses large volume historical data), then deploy the deep optimized version of the resultant models on edge devices for inference. Such approaches are inefficient in concept drift situations where the data generated at the device level vary frequently, and trained models are clueless on how to behave if previously unseen data arrives. The First Contribution of this thesis aims to solve this challenge. We provide Train++ Algorithm and ML-MCU Framework, that trains ML models locally at the device level (on MCUs and small CPUs) using the full n-samples of high-dimensional data. Train++ and ML-MCU transforms even the most resource-constrained MCU-based IoT edge devices into intelligent devices that can locally build their own knowledge base on-the-fly using the live data, thus creating smart self-learning and autonomous problemsolving devices. As a part of the first contribution, to perform online machine learning (OL) in non-ideal real-world settings, we designed Imbal-OL, an OL plugin that understands the supplied data stream and balances the class size before sending it for learning using our Train++, ML-MCU, or others. The hardware resource of IoT devices are orders of magnitude less than the resources required for the standalone execution of a large, high-quality NN. Currently, to alleviate various critical issues caused by the poor hardware specifications of IoT devices, before deployment the NNs are optimized using various methods such as pruning, quantization, sparsification, model architecture tuning, etc. Even after applying state-of-the-art optimization methods, there are numerous cases where the models after deep compression/ optimization still exceed a device’s memory capacity by a margin of just a few bytes, and users cannot optimize further since the model is already compressed to its maximum. The Second Contribution of this thesis aims to solve this challenge. We propose an approach for the efficient execution of already deeply compressed, large NNs on tiny IoT devices. After optimizing NNs using state-of-the-art deep model compression methods, when the resultant models are executed by MCUs or small CPUs using the model execution sequence produced by our approach, higher levels of conserved SRAM can be achieved. As a part of the second contribution, we provide an SRAM-optimized ML classifier (non-NN) porting, stitching, and efficient deployment approach. The proposed method enables large classifiers to be comfortably executed on MCU-based IoT devices and perform ultra-fast classifications while consuming 0 bytes of SRAM. Training a problem-solving ML model using large datasets is computationally expensive and requires a scalable distributed training platform to complete training within a reasonable time frame. In this scenario, communicating model updates among workers has always been a bottleneck. The magnitude of impact on the quality of resultant models is higher when distributed training on low hardware specification devices and in uncertain real-world IoT networks where congestion, latency, bandwidth issues are common. The Third Contribution of this thesis aims to solve this challenge. We provide Globe2Train (G2T), a framework with two components named G2T-Cloud (G2T-C) and G2T-Device (G2T-D) that can efficiently connect together multiple IoT devices and collectively train to produce the target ML models at very high speeds. The G2T framework components jointly eliminate staleness and improve training scalability and speed by tolerating the real-world network uncertainties and by reducing the communication-to-computation ratio. As a part of the third contribution, we provide ElastiQuant, an elastic quantization strategy that aims to further reduce the impact caused by limitations in distributed IoT training scenarios.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland