A NOVEL DEEP MODEL WITH STRUCTURE OPTIMIZATION FOR SCENE UNDERSTANDING

Hengyi Zheng,∗ Chuan Sun,∗∗ and Hongpeng Yin∗∗∗

References

  1. [1] C. Li, H. Gao, Y. Yang, X. Qu, and W. Yuan, Segmentation method of high-resolution remote sensing image for fast target recognition, International Journal of Robotics and Automation, 34(3), 2019, 216–224.
  2. [2] F. Bonin-Font, A. Ortiz, and G. Oliver, Visual navigation for mobile robots: A survey, Journal of Intelligent and Robotic Systems, 53(3), 2008, 263–296. 400
  3. [3] W. Yuan, Z. Cao, Y. Zhang, M. Tan et al., A robot pose estimation approach based on object tracking in monitoring scenes, International Journal of Robotics and Automation, 32(3), 2017, 256–265.
  4. [4] J. Chu, X. P. Liu, C. Jiao, J. Miao, and L. Wang, Multiview reconstruction of annular outdoor scenes from binocular video using global relaxation iteration, International Journal of Robotics and Automation, 26(3), 2011, 272.
  5. [5] J. D. Lin, X. Y. Wu, Y. Chai, and H. P. Yin, Structure optimization of convolutional neural networks: A survey, Acta Automatica Sinica, 46(1), 2020, 24–37.
  6. [6] W. Lu, H. Sun, J. Chu, X. Huang, and J. Yu, A novel approach for video text detection and recognition based on a corner response feature map and transferred deep convolutional neural network, IEEE Access, 6, 2018, 40198–40211.
  7. [7] S. Wang, L. Lan, X. Zhang, G. Dong, and Z. Luo, Cascade semantic fusion for image captioning, IEEE Access, 7, 2019, 66680–66688.
  8. [8] A. Farhadi, M. Hejrati, M. A. Sadeghi, et al., Every picture tells a story: Generating sentences from images, Proc. ECCV, Berlin, Heidelberg, 2010, 15–29.
  9. [9] M. Hodosh, P. Young, and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Proc. AAAI, Buenos Aires, Argentina, 2015, 4188– 4192.
  10. [10] Y. Yang, C. L. Teo, H. D. Iii, and Y. Aloimonos, Corpus-guided sentence generation of natural images, Proc. EMNLP, 2011.
  11. [11] R. Kiros, R. Salakhutdinov, and R. S. Zemel, Unifying visualsemantic embeddings with multimodal neural language models, Computer Science, arXiv:1411.2539, 2014.
  12. [12] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, Proc. IEEE CVPR, 2015.
  13. [13] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, Proc. IEEE CVPR, Boston, MA, 2015, 2625–2634.
  14. [14] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, What value do explicit high level concepts have in vision to language problems? Proc. IEEE CVPR, Las Vegas, NV, 2016, 203–212.
  15. [15] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, Variational autoencoder for deep learning of images, labels and captions, Proc. NIPS, Barcelona, Spain, 2016, 2352–2360.
  16. [16] J. Cheng, P.-s. Wang, G. Li, Q.-h. Hu, and H.-q. Lu, Recent advances in efficient computation of deep convolutional neural networks, Frontiers of Information Technology & Electronic Engineering, 19(1), 2018, 64–77.
  17. [17] Y. He, X. Zhang, and J. Sun, Channel pruning for accelerating very deep neural networks, Proc. IEEE ICCV, Venice, Italy, 2017, 1389–1397.
  18. [18] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, Pruning filters for efficient convnets, arXiv preprint arXiv:1608.08710, 2016.
  19. [19] S. Anwar and W. Sung, Coarse pruning of convolutional neural networks with random masks, 2016.
  20. [20] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, Learning separable filters, Proc. IEEE CVPR, Portland, OR, 2013, 2754–2761.
  21. [21] M. Jaderberg, A. Vedaldi, and A. Zisserman, Speeding up convolutional neural networks with low rank expansions, arXiv preprint arXiv:1405.3866, 2014.
  22. [22] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, Inceptionv4, inception-resnet and the impact of residual connections on learning, Proc. AAAI, San Francisco, California, 2017, 4278–4284.
  23. [23] F. Chollet, Xception: Deep learning with depthwise separable convolutions, Proc. IEEE CVPR, Honolulu, HI, 2017, 1251– 1258.
  24. [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Proc. NIPS, Lake Tahoe, Nevada, 2012, 1106–1114.
  25. [25] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. R. Mohamed, G. Dahl, and B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks, Neural Networks, 64, 39–48, 2015.
  26. [26] C. Szegedy, W. Liu, Y. Jia, et al. Going deeper with convolutions, Proc. IEEE CVPR, Boston, MA, 2015, 1–9.
  27. [27] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proc. IEEE CVPR, Las Vegas, NV, 2016, 770–778.
  28. [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, Proc. IEEE CVPR, Las Vegas, NV, 2016, 2818–2826.
  29. [29] T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.
  30. [30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, Proc. IEEE CVPR, Boston, MA, 2015, 3156–3164.
  31. [31] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, 2015, 2048–2057.

Important Links:

Go Back