Investigation of Deep Learning Optimization Algorithms in Scene Text Detection

Raisi, Zobeir; Zelek, John

doi:10.22111/ieco.2023.45650.1480

تعداد نشریات	31
تعداد شماره‌ها	839
تعداد مقالات	8,106
تعداد مشاهده مقاله	15,426,245
تعداد دریافت فایل اصل مقاله	10,321,619

	Investigation of Deep Learning Optimization Algorithms in Scene Text Detection
International Journal of Industrial Electronics Control and Optimization
مقاله 2، دوره 6، شماره 3، آذر 2023، صفحه 171-182 اصل مقاله (604.14 K)
نوع مقاله: Research Articles
شناسه دیجیتال (DOI): 10.22111/ieco.2023.45650.1480
نویسندگان
Zobeir Raisi^* ¹؛ John Zelek²
¹University of Waterloo, Waterloo, Canada. Chabahar Maritime University, Chabahar, Iran
²University of Waterloo, Waterloo, Canada,
چکیده
Scene text detection frameworks heavily rely on optimization methods for their successful operation. Choosing an appropriate optimizer is essential to performing recent scene text detection models. However, recent deep learning methods often employ various optimization algorithms and loss functions without explicitly explaining their selections. This paper presents a segmentation-based text detection pipeline capable of handling arbitrary-shaped text instances in wild images. We explore the effectiveness of well-known deep-learning optimizers to enhance the pipeline's capabilities. Additionally, we introduce a novel Segmentation-based Attention Module (SAM) that enables the model to capture long-range dependencies of multi-scale feature maps and focus more accurately on regions likely to contain text instances. The performance of the proposed architecture is extensively evaluated through ablation experiments, exploring the impact of different optimization algorithms and the introduced SAM block. Furthermore, we compare the final model against state-of-the-art scene text detection techniques on three publicly available benchmark datasets, namely ICDAR15, MSRA-TD500, and Total-Text. Our experimental results demonstrate that the focal loss combined with the Stochastic Gradient Descent (SGD) + Momentum optimizer with poly learning-rate policy achieves a more robust and generalized detection performance than other optimization strategies. Moreover, our utilized architecture, empowered by the proposed SAM block, significantly enhances the overall detection performance, achieving competitive H-mean detection scores while maintaining superior efficiency in terms of Frames Per Second (FPS) compared to recent techniques. Our findings shed light on the importance of selecting appropriate optimization strategies and demonstrate the effectiveness of our proposed Segmentation-based Attention Module in scene text detection tasks.
کلیدواژه‌ها
Deep learning؛ Scene text detection؛ Optimization؛ Loss function

مراجع
[1] H. Lin, P. Yang, and F. Zhang, “Review of scene text detection and recognition, ” Archives of Computational Methods in Eng., pp. 1–22, 2019. [2] X. Liu, G. Meng, and C. Pan, “Scene text detection and recognition with advances in deep learning: A survey,” Int. J. on Document Anal. and Recognition (IJDAR), pp. 1–20, 2019. [3] Z. Raisi, M. A. Naiel, G. Younes, P. Fieguth, and J. Zelek, “Smart text reader system for people who are blind using machine and deep learning, ” Machine Learning Algorithms for Signal and Image Processing, pp. 161–200, 2022. [4] Z. Raisi and J. Zelek, “Text detection & recognition in the wild for robot localization,” arXiv preprint arXiv:2205.08565, 2022.[5] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, and J. Zelek, “Text detection and recognition in the wild: A review, ” arXiv preprint arXiv:2006.04305, 2020. [6] Z. Raisi, “Text detection and recognition in the wild, ” 2022. [7] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and vision computing, vol. 22, no. 10, pp. 761–767, 2004. [8] L. Neumann and J. Matas, “A method for text localization and recognition in real-world images, ” in Proceedings Asian Conference on Computer Vision. Springer, 2010, pp. 770–783. [9] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2963–2970. [10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector, ” in Eur. Conference on Computer Vision. Springer, 2016, pp. 21–37. [11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779– 788. [12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proceedings Adv. in Neural Info. Process. Sys., 2015, pp. 91–99. [13] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [14] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask RCNN,” in Proceedings IEEE Int. Conference on Computer Vision, 2017, pp. 2961–2969. [15] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2019. [16] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning, ” in International conference on machine learning, 2013, pp. 1139– 1147. [17] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of machine learning research, vol. 12, no. Jul, pp. 2121–2159, 2011. [18] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012. [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015. [21] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector, ” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560. [22] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network, ” in Proceedings AAAI Conference on Artificial Intelligence, 2017. [23] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi-oriented text detection,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1962–1969. [24] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot oriented scene text detector, ” IEEE Trans. on Image Process. , vol. 27, no. 8, pp. 3676–3690, 2018. [25] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals, "IEEE Trans. on Multimedia", vol. 20, no. 11, pp. 3111-3122, 2018. [26] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-oriented text detection with fully convolutional networks,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167. [27] D. Deng, H. Liu, X. Li, and D. Cai, “Pixellink: Detecting scene text via instance segmentation,” in Proceedings AAAI Conference on Artificial Intelligence, 2018. [28] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of ararbitrary shapes,” in Proceedings Eur. Conference on Computer Vision (ECCV), 2018, pp. 20–36. [29] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao, “Scene text detection via holistic, multi-channel prediction,” arXiv preprint arXiv:1606.09002, 2016. [30] B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments, ” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2550–2558. [31] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen, “Efficient and accurate arbitrary-shaped text detection with pixel aggregation network, ” in Proceedings of the IEEE Int. Conference on Computer Vision, 2019, pp. 8440–8449. [32] P. Yang, G. Yang, X. Gong, P. Wu, X. Han, J. Wu, and C. Chen, “Instance segmentation network with self-distillation for scene text detection,” IEEE Access, vol. 8, pp. 45 825– 45 836, 2020. [33] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai. Real-time scene text detection with differentiable binarization. In AAAI Conference on Artificial Intelligence, pages 11474– 11481, 2020. [34] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao, “Shape robust text detection with progressive scale expansion network, ” arXiv preprint arXiv:1903.12473, 2019. [35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proceedings IEEE Int. Conference on Computer Vision, 2017, pp. 2980–2988. [36] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The marginal value of adaptive gradient methods in machine learning, ” in Advances in Neural Information Processing Systems, 2017, pp. 4148–4158. [37] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. [38] T. Dozat, “Incorporating nesterov momentum into adam,” 2016. [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [40] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and beyond, ” arXiv preprint arXiv:1904.09237, 2019. [41] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017. [42] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images, ” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1083–1090. [43] C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition, ” in Proceedings IAPR Int. Conf. on Document Anal. and Recognition (ICDAR), vol. 1, 2017, pp. 935–942. [44] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in Proceedings Int. Conf. on Document Anal. and Recognition (ICDAR), 2015, pp. 1156–1160. [45] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images, ” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2016, last retrieved March 11, 2020. [46] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, “Pyramid mask text detector, ” CoRR, vol. abs/1903.11800, 2019. [47] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images, ” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2315–2324. [48] N. Darjani and H. Omranpour, “Comprehensive learning polynomial auto-regressive model based on optimization with application of time series forecasting,” International Journal of Industrial Electronics Control and Optimization, vol. 5,no. 1, pp. 43–50, 2022. [49] S. Kalantari, M. Ramezani, and A. Madadi, “Introducing a new hybrid adaptive local optimal low-rank approximation method for denoising images,” International Journal of Industrial Electronics Control and Optimization, vol. 3, no. 2, pp. 173–185, 2020. [50] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 713–13 722. [51] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE/CVF international conference on Computer vision Workshops, 2019. [52] J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4563–4572. [53] Y. Su, Z. Shao, Y. Zhou, F. Meng, H. Zhu, B. Liu, and R. Yao, “Textdct: Arbitrary-shaped text detection via discrete cosine transform mask,” IEEE Transactions on Multimedia, 2022. [54] S.-X. Zhang, X. Zhu, J.-B. Hou, C. Liu, C. Yang, H. Wang, and X.-C. Yin, “Deep relational reasoning graph network for arbitrary shape text detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9699–9708. [55] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, and W. Zhang, “Fourier contour embedding for arbitrary-shaped text detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3123– 3131. [56] P. Dai, S. Zhang, H. Zhang, and X. Cao, “Progressive contour regression for arbitrary-shape scene text detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7393–7402. [57] S.-X. Zhang, X. Zhu, C. Yang, H. Wang, and X.-C. Yin, “Adaptive boundary proposal network for arbitrary shape text detection,” in Proceedings of the IEEE/CVF International Conference on computer vision, 2021, pp. 1305–1314. [58] Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2021. [59] W. Wang, Y. Zhou, J. Lv, D. Wu, G. Zhao, N. Jiang, and W. Wang, “Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5014–5025. [60] C. Yang, M. Chen, Y. Yuan, and Q. Wang, “Text growing on leaf,” IEEE Transactions on Multimedia, 2023. [61] V. Nazarzehi and R. Damani, “Decentralised optimal deployment of mobile underwater sensors for covering layers of the ocean,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 25, no. 2, pp. 840–846, 2022. [62] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo, “Gpt-4 passes the bar exam,” Available at SSRN 4389233, 2023. [63] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral and brain sciences, vol. 40, p. e253, 2017. [64] Z. Raisi and J. Zelek, “Occluded text detection and recognition in the wild,” in 2022 19th Conference on Robots and Vision (CRV). IEEE, 2022, pp. 140–150.
آمار تعداد مشاهده مقاله: 834 تعداد دریافت فایل اصل مقاله: 587

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

اخبار و اعلانات

آمار

Investigation of Deep Learning Optimization Algorithms in Scene Text Detection