Improving Human Pose Estimation with Integrated Dual Self-Attention Mechanism in High-Resolution Network

Main Article Content

S Prabha Kumaresan https://orcid.org/0000-0002-0969-7428
Lee Fong Yee https://orcid.org/0000-0003-2795-7745
Naveen Palanichamy
Elham Annan

Keywords

Computer Vision, Human Pose Estimation, Convolution Neural Network, High Resolution Network, Dual Self-Attention.

Abstract

Human Pose Estimation (HPE) in computer vision (CV) has garnered significant attention due to its diverse applications. Deep convolutional neural networks (CNNs) may be solutions for addressing this challenge, but still face several critical issues. Many existing models employ serial convolution with pooling, leading to low-resolution outputs that are suboptimal for the precise localisation required in HPE. They often prioritise local feature learning, overlooking crucial contextual relationships between key-points. This work addresses these challenges by proposing a novel approach for enhancing HPE. Firstly, the paper evaluates the high-resolution network (HRNet) and its comparative advantages over other CNN architectures. Secondly, it introduces a dual self-attention (DSA) mechanism designed to enhance the model’s global awareness, thereby enriching feature maps with contextual information. Thirdly, it integrates the DSA mechanism into HRNet, crafting DSA-HRNet. The model performance was tested on the COCO Val 2017 validation dataset, showing improvements of 2.3% in mean average precision (mAP), 3% in AP at 50 (AP50), and 2.7% in AP at 75 (AP75). Finally, the work includes an investigation into the effectiveness of the DSA mechanism within the HRNet framework, through a series of experiments, showing this work offers a streamlined and effective solution for improving HPE.

Downloads

Download data is not yet available.
Abstract 59 | 984-PDF-v12n3pp7-28 Downloads 3

References

Artacho, B., & Savakis, A. (2020). Unipose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7035–7044. https://doi.org/10.48550/arXiv.2001.08095
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. https://doi.org/10.48550/arXiv.1409.0473
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7291–7299. https://doi.org/10.48550/arXiv.1611.08050
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4733–4742. https://doi.org/10.1109/CVPR.2016.512
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7103–7112. https://doi.org/10.48550/arXiv.1711.07319
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3146–3154. https://doi.org/10.48550/arXiv.1809.02983
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, 2961–2969. https://doi.org/10.1109/ICCV.2017.322
Li, Y., Wang, C., Cao, Y., Liu, B., Luo, Y., & Zhang, H. (2020). A-HRNet: Attention based high resolution network for human pose estimation. In 2020 Second International Conference on Transdisciplinary AI (TransAI), 75–79. IEEE. https://doi.org/10.1109/TransAI49837.2020.00016
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. https://doi.org/10.48550/arXiv.1405.0312
Luo, Y., Ren, J., Wang, Z., Sun, W., Pan, J., Liu, J., & Lin, L. (2018). LSTM pose machines. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5207–5215. https://doi.org/10.48550/arXiv.1712.06316
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4903–4911. https://doi.org/10.1109/CVPR.2017.395
Sun, J., Jiang, J., & Liu, Y. (2020). An introductory survey on attention mechanisms in computer vision problems. In 2020 6th International Conference on Big Data and Information Analytics (BigDIA), 295–300. IEEE. https://doi.org/10.1109/BigDIA51454.2020.00054
Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703. https://doi.org/10.48550/arXiv.1902.09212
Tan, V. W. S., Ooi, W. X., Chan, Y. F., Connie, T., & Goh, M. K. O. (2024). Vision-Based Gait Analysis for Neurodegenerative Disorders Detection. Journal of Informatics and Web Engineering, 3(1), 136–154. https://doi.org/10.33093/jiwe.2024.3.1.9
Ti, Y. F., Connie, T., & Goh, M. K. O. (2023). GenReGait: Gender Recognition using Gait Features. Journal of Informatics and Web Engineering, 2(2), 129–140. https://doi.org/10.33093/jiwe.2023.2.2.10
Toshev, A., & Szegedy, C. (2014). DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1653–1660. https://doi.org/10.1109/CVPR.2014.214
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998–6008. https://doi.org/10.48550/arXiv.1706.03762
Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 1, 7068349. https://doi.org/10.1155/2018/7068349
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2020). Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10), 3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686
Wang, X., Tong, J., & Wang, R. (2021). Attention refined network for human pose estimation. Neural Processing Letters, 53(4), 2853–2872. https://doi.org/10.1007/s11063-021-10523-9
Wu, N., Gao, H., Wang, P., Li, X., & Lv, Z. (2023). High-resolution human pose estimation based on location awareness. In Third International Symposium on Computer Engineering and Intelligent Communications (ISCEIC 2022), Vol. 12462, 129–135. SPIE. https://doi.org/10.1117/12.2660942
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). HRFormer: High-resolution transformer for dense prediction. arXiv preprint arXiv:2110.09408. https://doi.org/10.48550/arXiv.2110.09408