MuRelSGG: Multimodal relationship prediction for neurosymbolic scene graph generation
Khan, Muhammad Junaid ; Siddiqui, Adil Masood ; Khan, Hamid Saeed ; Akram, Faisal ; Khan, Muhammad Jaleed
Khan, Muhammad Junaid
Siddiqui, Adil Masood
Khan, Hamid Saeed
Akram, Faisal
Khan, Muhammad Jaleed
Loading...
Publication Date
2025-03-14
Type
journal article
Downloads
Citation
Khan, M. Junaid, Siddiqui, A. Masood, Khan, H. Saeed, Akram, F., & Khan, M. Jaleed. (2025). MuRelSGG: Multimodal Relationship Prediction for Neurosymbolic Scene Graph Generation. IEEE Access, 13, 47042-47054. https://doi.org/10.1109/ACCESS.2025.3551267
Abstract
Neurosymbolic Scene Graph Generation (SGG) is a promising approach that jointly leverages the perception capabilities of deep neural networks and the reasoning capabilities of symbolic techniques for scene understanding and visual reasoning. SGG systematically captures semantic components, including objects and their relationships, in images, enabling structured representations of visual data. However, existing SGG methods exhibit constrained accuracy and limited expressiveness, particularly in long-tail relationship prediction. To address these limitations, we present MuRelSGG, a novel neurosymbolic SGG framework that integrates a Transformer-based multimodal relationship prediction pipeline with common sense knowledge enrichment. This synergistic combination encapsulates global context, long-range dependencies, and complex object interactions to enhance relationship prediction in SGG. The proposed neurosymbolic architecture begins with object detection via Faster R-CNN, followed by a cascade of Multi-Head Attention Transformers (M-HAT) and Vision Transformers (ViT) for relationship prediction. Subsequently, CSKG enrichment refines and augments visual relationships, improving both accuracy and expressiveness. We conduct extensive evaluations on both the Visual Genome (VG) and GQA datasets to assess performance and generalizability. MuRelSGG achieves substantial gains in recall rates (VG: R@ 100=43.2 , mR@ 100=14.9 ; GQA: R@ 100=42.1 ), outperforming state-of-the-art SGG techniques. Ablation studies confirm the critical contributions of M-HAT, ViT, linguistic features, CSKG enrichment and embedding similarity thresholds, demonstrating the effectiveness of structured knowledge integration for long-tail relationship prediction. These findings underscore the potential of combining deep learning architectures with structured knowledge bases to advance visual scene representation and reasoning.
Funder
Publisher
Institute of Electrical and Electronics Engineers
Publisher DOI
Rights
Attribution 4.0 International