Publication

Balanced-simplified spatiotemporal memory attention for image captioning

Al-Qatf, Majjed
Yahya, Ali Abdullah
Hawbani, Ammar
Alsamhi, Saeed Hamood
Jiang, Junming
Curry, Edward
Citation
Al-Qatf, M., Yahya, A. A., Hawbani, A., Alsamhi, S. H., Jiang, J., & Curry, E. (2025). Balanced-Simplified Spatiotemporal Memory Attention for Image Captioning. IEEE Access, 13, 63671-63689. https://doi.org/10.1109/ACCESS.2025.3552217
Abstract
Visual attention and memory-based attention methods have been effectively utilized in image captioning to focus on the most relevant areas of an image during the language generation process. Nevertheless, they face significant challenges, as they are solely guided by the hidden state of the LSTM, leading to attention focused on less relevant areas at various time steps. Furthermore, many approaches apply a uniform focus across all visual information within an image, lacking a mechanism for adjustment of focus intensity. Additionally, the complexity of memory-based attention methods highlights the necessity of developing a simplified memory-based attention mechanism for more efficient and effective image captioning. To address these challenges, a novel attention method for image captioning, named BalancedSimplified Spatiotemporal Memory Attention (BS-STMA), is proposed. The proposed attention mechanism captures spatiotemporal relationships by combining the advantages of LSTM and visual attention in a simple and effective manner. The combination of LSTM memory and the attention mechanism significantly enhances the model’s capacity to retain, convey, and utilize relevant visual information throughout the captioning process. Additionally, an Intensity Balancing Controller (IBC) is introduced and integrated into BS-STMA to enhance its efficiency. IBC allows for adjustments of attention intensity, enabling the model to capture visual information more accurately over time. Extensive experiments on the MSCOCO dataset demonstrate that the method significantly improves image captioning performance, surpassing recent approaches in various evaluation metrics by effectively capturing spatiotemporal.
Publisher
Institute of Electrical and Electronics Engineers
Publisher DOI
Rights
Attribution 4.0 International