Detecting seen/unseen objects with reducing response time for multimedia event processing

Aslam, Asra
The enormous growth of multimedia content in the domain of the Internet of Things (IoT) leads to the challenge of processing multimedia streams in real-time. Thus, the Internet of Multimedia Things (IoMT) is an emerging concept in the field of smart cities. Conventional event-based systems mainly focus on structured events like energy consumption events, RFID tag readings, packet loss events, etc. Existing real-time image processing systems are domain-specific. Multiple applications (like traffic management, security, supervision activities, etc.) within smart cities may require the processing of numerous seen and unseen concepts (unbounded vocabulary). Deep neural network-based techniques are effective for image recognition. The limitation of having to train classifiers for unseen concepts may increase the overall response-time for multimedia-based event processing models. It is not practical to have all trained classifiers or annotated training data available for a large number of unseen concepts. In this thesis, I address the problem of training classifiers online for unseen concepts to answer user queries that include processing multimedia events in minimum response time and maximum accuracy. The contributions of this thesis are manifold. I proposed an IoMT based multimedia event processing model and optimized it for various scenarios of unseen concepts using hyperparameter tuning, transfer learning, and Large Scale Detection through Adaptation (LSDA). I primarily consider You Only Look Once (YOLO), Single Shot MultiBox Detector (SSD), and RetinaNet, in my models, specifically for the case study of object detection. The results indicate that proposed multimedia event processing models achieve accuracy of 66.34% within 2 hours using classifier division and selection approach, 84.28% within 1 hour using hyperparameter tuning, and 95.14% using domain adaptation-based optimization within 30 min of response-time. My final contribution is designing the first and fast detector “UnseenNet”, to train unseen classes using only image-level labels (i.e., no bounding box annotations). My evaluations demonstrate that UnseenNet outperforms the baseline approaches and reduces the training time of >5.5 hours to <5 minutes.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland