《面向多视角动作识别的视角知识转移网络.docx》由会员分享,可在线阅读,更多相关《面向多视角动作识别的视角知识转移网络.docx(16页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、面向多视角动作识别的视角知识转移网络广东工业大学梁梓熙,尹明(1.广东工业大学自动化学院 广东省广州市510006;2.广东工业大学自动化学院广东省广州市510006)(Image and Vision Computing , 2021 年 12 月 21 日)尹明,教授国家级大学生创新创业训练计划支持项目()作者简介:梁梓熙(2001.)、男,广东广州人,自动化,2019级,模式识别。在两个相近的值。基于分类器的融合方法可能会破坏初始预测的可靠性,即情况 (i)-(iii),并返回四。相(或夕。吨-r,夕。nZy-Q的不可靠性预测,从而对最终结果产 生负面影响。而对于(ii)-(iv)的情况
2、,后期的融合结果可能会退化为单一视图 预测。相反,所提出的SSN可以在不破坏初始预测的情况下保证融合性能,因为 它旨在了解每个初始结果的权重。此外,初始预测的权重在SSN中共享,这样就 很好地解决了基于分类器的融合方法的缺点。四、实验结果(一)数据集在本节中,使用三个人类行为数据集来评估我们的模型,如表1所示。 UWA3D多视图活动(UWA) 26 o UWA由Kinect传感器收集,该传感器从多 个视图捕获人类活动。具体来说,有10名受试者连续执行30个动作,没有休息 或停顿。由于不同的视角、自遮挡和活动之间的高度相似性,数据集具有挑战性 36 o 伯克利多模态人类行为数据库(MHAD) 2
3、3 o MHAD是一个全面的多模态人 类行为数据集,包括RGB、深度、骨架、加速度和音频模式。它包括12名受试者 进行的11个动作,每个动作重复5次,总共产生660个动作序列。在我们的实 验中,与36的工作类似,我们使用244个样本作为训练数据,而283个样本用 于测试。 深度包括人类行为数据集(DHA) 14o DHA是一个具有三种模式的多模式 数据集,包含21名受试者进行的23个类别。总共有483个视频剪辑用于培训和 测试。每个动作都有RGB图像、人类面具和深度数据。DatasetDataModalitiesSubjectsFeatureDimensionTrainSamplesTestS
4、amplesCategoriesUWA26RGB+Depth106144+11025425330MHAD23RGB+Depth126144+11024428311DHA14RGB+Depth216144+11024024323表格1我们实验中使用的数据集摘要(二)实施我们的实验使用了两种视图,即RGB和深度信息,,其中TSN37用于提取RGB, WDMM2用于深度信息。对于%r,我们首先将其分成五个部分,然后随机选择一个片段。为 了有效地提取特征,本文采用了 ImageNet预训练的ResNetTOl。接下来,我们将从每个 视频样本中随机选择三个片段,输入经过训练的ResNetTOl,并将它们
5、合并为最终样本。换句话说,在我们的实验中,由6144维向量表示。深度特征由WDMM在三个投影视图中 提取。接下来,使用H0G4和LBP24提取与VLAD10和PCA38相关的特征,以进行特 征降维。最后,输入及由110维的向量用表示。我们的模型由AdamW优化器优化,学习率 为1x10-5.对于每个实验,用1104个与0而光训练网络。为了提高模型的通用性,参 数a = 1x10,在训练CSSN时,用于减轻权重CssN(,,)Cd() Gr() Gd(-)如()。批量大 小设置为32o(三)结果实验结果如表2所示,其中RGB表示只有RGB信息可用,类似于深度的情 况。RGB+D表示两种视图都可用
6、。对于单视图方法,即NN27、LSR和SVM28, 所有视图都连接成一个长特征,以便输入。从表中可以看出,我们提出的模型达 到了最佳性能。当所有视图都可用时,我们的方法实现了与比较方法相比的SOTAo具体来 说,对于UWA,我们的模型平均提高了 6.33%的分类分数,UCB提高了 0.39%, DHA 提高了 0. 97%O在某些视图缺失的情况下,可以看出现有的动作识别方法在一定程度上退化, 甚至有些方法无法工作。与目前的SOTA方法相比,在UWA上,仅使用RGB时的 增益为2. 51%,仅使用深度信息时的增益为11. 89%。对于UCB,仅使用RGB或深 度时,增益分别为0.49%和5. 5
7、8%。与DHA的情况类似,仅使用RGB或深度,我 们分别获得7.45%和3.55%的增益。换句话说,我们的方法明显优于其他方法, 即使在可用信息很少的情况下,并且有效地避免了退化。实践证明,我们的方法 是有效的和优越的。MethodYearUWAUCBDHARGBDepthRGB + DRGBDepthRGB + DRGBDepthRGB + DLSR一675945.4568.7796.46476397.1765.02823077.30NN 2711986一一73.70一.9d88.56.01SVM|28200269.77349272.7296.0945399&8066.11789283.47
8、VLAD|520177154一97.17一一6765一TSN|37201671.0197.31一6765一一WDMM(22018465866.4181X)5AMGL |211201671543596685397.112956947059.05673374.89MLAM (20201767.1933.6166.6496.1041259&466751726376.13AMUSE 222019一7032一97.23一一78.12GMVHAR361201673535035762898.2368329&9469.7283.4888.76CVCA|17202077D89&9489.31CCMVR(ours
9、)202176.04 (1.04)6224 (1.70)83.41 ( 1.36)98.72 (029)7330 (1.19)99133(0.19)77.17 (128)871)3( 1.27)90.28 (127)表格2UWA、UCB和DHA数据集的分类结果()“-”意味着原始作品中缺少结果(四)消融实验在本节中,我们将验证网络中每个部分的有效性。具体来说,生成性对抗框 架、后期融合和视图知识转换模块相继进行了测试。1 .生成性对抗框架首先,评估生成性对抗框架的有效性。如果不生成视图,丢失的视图将被零 填充。结果见表3O具有生成对抗模块的网络能够在某些视图丢失时保持性能, 尤其是在视图信息不
10、丰富的情况下。此外,以生成对抗方式生成的假视图,可以 有效地补充特征子空间中的样本,从而获得更好的学习性能。表3对此进行了充分评估。Situationwithout GANwith GANUWARGB74.48 (0.90)76.04 (1.04)Depth57.99 (2.17)62.24 (1.70)RGB + D80.21 (0.52)83.41 (1.36)UCBRGB98.96 (0.23)98.72 (0.29)Depth70.90 (1.94)73.90 (1.19)RGB + D99.22 (0.39)99.33 (0.19)DHARGB77.26 (1.83)77.17 (0
11、.77)Depth84.20 (1.68)87.03 (1.25)RGB + D89.93 (1.97)90.28 (1.13)表格3有/无生成性对抗框架的VKTNet评估2 .后期融合模块接下来,我们将测试后期融合模块的网络性能。具体来说,SSN的有效性在 这里得到了验证。我们的网络将分别与单FC网络(用稠密-1表示)、双FC网 络(稠密-2)、单视图相关发现网络(VCDN) 36 (VCDN-1)、双VCDN (VCDN- 2)和加法策略(A+B) 13进行比较。在表4中,结果表明,使用初始预测结果作为某种前馈网络输入将导致模型 性能下降(例如,密集1/2的性能低于A+B) o此外,由于初
12、始预测已经包含了 相对完整的信息,使用相关矩阵36可能会获得大量冗余信息,这将降低模型性 能。最后,观察到非线性SSN模块(SSN-2)的性能不如线性SSN模块(SSNT), 这也验证了最终的融合结果将不可避免地因受损信息传播而退化。因此,最终的 SSN模块采用线性结构。MethodUWAUCBDHADense-175.78(2.81)96.22(1.26)85.94(1.35)Dense-277.34(1.36)98.44(0.68)82.81(1.57)VCDN-174.48(2.38)94.66(1.19)81.77(1.19)VCDN-275.77(0.78)98.05(1.03)83
13、.59(2.07)A+B77.86(1.36)98.44(0.68)83.92(1.75)VCDN-SSN-279.17(0.45)98.70(0.60)84.37(2.71)SSN-183.41 (1.36)99.33 (0.19)90.28 (1.13)SSN-280.90(0.30)99.22(0.00)89.06(0.90)表格4不同后期融合模块对VKTNet的评估3 .特征子空间学习在这一部分中,评估了子空间学习模块的有效性。为了研究不同损失函数 对特征子空间的影响,训练损失函数如下j =矶SCE(%,y) + (1 -/?)CF(jvy) + (1 - a)TL(z%zr), “
14、=由y) + (1 -0)CE(方,y) + (1 - a)TL(zzd)o (5)其中TL(一)表示三重态损耗,ci是平衡两个损耗的惩罚参数。实验结果如图4所示,其中X-R表示数据集X中只有RGB信息可用,类似于 深度(X-D)的情况。X-B表示这两个视图在X数据集中都可用。垂直误差条代表 正标准。结果表明,在信息量不丰富的情况下(如图4中的UWA-D),交叉燧损 失的性能优于三重态损失。在其他情况下,两种损失函数对性能的影响没有太大 差异。此外,使用t-SNE35,我们的VKTNet为UCB数据集中的不同视图学习的 特征子空间如图5所示。所有可视化结果都经过了 5000次训练。结果表明,基
15、 于交叉端损失的子空间学习模块能够更好地保持区分信息。图4评估具有不同子空间学习损失的VKTNetDHAUCBUWA。Real Sample + Generated Sample图5 t-SNE35UCB中不同视图的特征子空间的可视化结果数据集。五、结论为了更好地利用多视图之间的相关性,本文提出了一种用于多视图动作识别的视图知识 转移网络(VGTNET),该网络能够提取高层语义特征,从而弥合两个不同视图之间的语义 鸿沟。此外,为了有效融合每个视图的决策结果,提出了一种简单而有效的暹罗尺度网络 (SSN),而不是简单地使用分类器。实验结果表明,我们的模型可以提高多视角动作识别 的性能,即使是在测
16、试阶段缺少视角的情况下,尤其是在初始预测不好的情况下。竞合利益声明作者声明,他们没有已知的可能会影响本文报道的工作的相互竞争的经济利益或个人关 系致谢这项工作得到了国家自然科学基金和广东省基础与应用基础研究基金(编号:2020A) 的部分资助。参考文献:1 T. Akilan, Q. M. Jonathan Wu, Amin Safaei, Wei Jiang, A late fusion approach for harnessing multi-cnn model high-level features, 2017 IEEE International Conference on Syst
17、ems, Man, and Cybernetics (SMC), IEEE 2017, pp. 566 - 571.2 Reza Azad, Maryam Asadi-Aghbolaghi, Shohreh Kasaei, Sergio Escalera, Dynamic 3d hand gesture recognition by learning weighted depth motion maps, IEEE Trans. Circuits Syst. Video Technol. 29 (6) (2018) 1729 - 1740.3 Darklight networks for ac
18、tion recognition in the dark, in: Rui Chen, Jiajun Chen, Zixi Liang, Huaien Gao, Shan Lin (Eds.), Proceedings of CVPR 2021, pp. 846 - 852.4 Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human detection, 2005 IEEE computer society conference on computer vision and pattern recogniti
19、on (CVPR 05), vol. 1, leee 2005, pp. 886 - 893.5 Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell, Actionvlad: learning spatio-temporal aggregation for action classification, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, pp. 971 - 980.6 Ian
20、 J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, Yoshua Bengio, Generative adversarial nets, Proceedings of NeurlPS 2014, pp. 2672 - 2680.7 Jointly learning heterogeneous features for rgb-d activity recognition, in: Hu JianFang, Wei-Shi
21、 Zheng, Jianhuang Lai, Jianguo Zhang (Eds.), Proceedings of CVPR 2015, pp. 5344 - 5352.8 Yan Huang, Wei Wang, Liang Wang, Unconstrained multimodal multi-label learning, IEEE Trans. Multimedia 17 (11) (2015) 1923 - 1935.9 Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-image trans
22、lation with conditional adversarial networks, Proceedings of CVPR 2017, pp. 1125 - 1134.10 Herve Jegou, Matthijs Douze, Cordelia Schmid, Patrick Perez, Aggregating local descriptors into a compact image representation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
23、 IEEE 2010, pp. 3304 - 3311.11 Karani Kardas, Nihan Kesim Cicekli, Svas: surveillance video analysis system, Expert Syst. Appl. 89 (2017) 343 - 361.12 Yu Kong, Fu. Yun, Bilinear heterogeneous information machine for rgb-d action recognition, Proceedings of CVPR 2015, pp. 1054 - 1062.13 Ludmila I. Ku
24、ncheva, A theoretical study on six classifier fusion strategies, IEEE Trans. Pattern Anal. Mach. Tntell. 24 (2) (2002) 281 - 286.14 Yan-Ching Lin, Hu Min-Chun, Wen-Huang Cheng, Yung-Huan Hsieh, Hong-Ming Chen, Human action recognition and retrieval using sole depth information, Proceedings of the 20
25、th ACM International Conference on Multimedia 2012, pp. 1053 - 1056.15 Li Liu, Ling Shao, Learning discriminative representations from rgb-d video data, Proceedings of IJCAI 2013, pp. 1493 - 1500.16 Mengyuan Liu, Junsong Yuan, Recognizing human actions as the evolution of pose estimation maps, Proce
26、edings of CVPR 2018, pp. 1159 - 1168.17 Generative view-corre1ation adaptation for semi-supervised multi-view learning, in: Yunyu Liu, Lichen Wang, Yue Bai, Can Qin, Zhengming Ding, Fu. Yun (Eds.), Proceedings of ECCV, Springer 2020, pp. 318 - 334.18 Mehdi Mirza, Simon Osindero, Conditional generati
27、ve adversarial nets. arXiv preprint arXiv:1411. 1784, 201419 Bingbing Ni, Gang Wang, Pierre Moulin, Rgbd-hudaact: a color-depth video database for human daily activity recognition, Proceedings of ICCV (Workshops), IEEE 2011, pp. 1147 - 1153.20 Feiping Nie, Guohao Cai, Xuelong Li, Multi-view clusteri
28、ng and semi-supervised classification with adaptive neighbours, Thirty-First AAAI Conference on ArtificialIntelligence, 2017.21 Feiping Nie, Jing Li, Xuelong Li, et al., Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification, Pro
29、ceedings of IJCAI 2016, pp. 1881 - 1887.22 Feiping Nie, Lai Tian, Rong Wang, Xuelong Li, Multiview semi-supervised learning model for image classification, IEEE Trans. Knowl. Data Eng. 32 (12) (2019) 2389 - 2400.23 Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, Ren6 Vidal, Ruzena Bajcsy, Berkeley mh
30、ad: A comprehensive multimodal human action database, 2013 IEEE Workshop on ApplicationsIEEE 2013, pp. 53 - 60.of Computer Vision (WACV),24 Timo Ojala, Mattirotation invariant texturePietikainen, Topi Maenpaa, Multiresolution gray-scale and classification with local binary patterns, IEEE Trans. Patt
31、ern(2002) 971 - 987.Anal. Mach. Intell. 24 (7)25 Alessandro Ortis,Giovanni M. Farinella, Valeria D Amico, Luca Addesso, GiovanniTorrisi, Sebastiano Battiato, Organizing egocentric videos of daily living activities, Pattern Recogn. 72 (2017) 207 - 218.26 Hossein Rahmani, Du Arif Mahmood, Huynh, and A
32、jmal Mian. , Histogram of oriented principal components for cross-view action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 38 (12) (2016) 2430 - 2443.27 David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Learning representations by back-propagating errors, nature 323 (6088) (1986) 5
33、33 - 536.28 Bernhard Scholkopf, Alexander J. Smola, Francis Bach, et al., Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002.29 Florian Schroff, Dmitry Kalenichenko, James Philbin, Facenet: a unified embedding for face recognition and clustering
34、, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, pp. 815 - 823.30 Mubarak Shah Shruti Vyas, Yogesh S. Rawat, Multi-view action recognition using cross-view video prediction, Proceedings of ECCV, Vol. 12372, , 2020.31 Cees G. M. Snoek, Marcel Worring, Arnold W.M.
35、Smeulders, Early versus late fusion in semantic video analysis, Proceedings of ACM MM 2005, pp. 399 - 402.32 Kihyuk Sohn, Wenling Shang, Honglak Lee, Improved multimodal deep learning with variation of information, Adv. Neural Inf. Proces. Syst. 27 (2014) 2141 - 2149L33J Yan Song, Shi Liu, Jinhui Ta
36、ng, Describing trajectory of surface patch for human action recognition on rgb and depth videos, IEEE Signal Processing Letters 22 (4) (2014) 426 - 429.34 Nitish Srivastava, Ruslan Salakhutdinov, et al. , Multimodal learning with deep boltzmann machines, J. Mach. Learn. Res. 15 (1) (2014) 2949 - 298
37、0.35 Laurens Van der Maaten, Geoffrey Hinton, Visualizing data using t-sne, J. . Res. 9 (11) (2008).36 Generative multi-view human action recognition, in: Lichen Wang, Zhengming Ding, Zhiqiang Tao, Yunyu Liu, Fu. Yun (Eds.), Proceedings of ICCV 2019, pp. 6212 - 6221.37 Temporal segment networks: tow
38、ards good practices for deep action recognition, in: Limin Wang, Yuanjun Xiong, Yu Zhe Wang, Dahua Lin Qiao, Xiaoou Tang, LucVan Gool (Eds.), European Conference on Computer Vision, Springer 2016,pp. 20 - 36.38 Svante Wold, Kim Esbensen, Paul Geladi, Principal component analysis, 1. Lab. Syst. 2 (1
39、- 3) (1987) 37 - 52.39 Sijie Yan, Yuanjun Xiong, Dahua Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, Proceedings of AAAI 2018, pp. 7444 - 7452.40 Jianfei Yang, Han Zou, Hao Jiang, Lihua Xie, Carefi: Sedentary behavior monitoring system via commodity wifi i
40、nfrastructures, IEEE Trans. Veh. Technol. 67 (8) (2018) 7620 - 7629.41 Ming Yin, Weitian Huang, Junbin Gao, Shared generative latent representation learning for multi-view clustering, Proceedings of AAAI, vol. 4, 2020,pp. 6688 - 6695.42 Qiyue Yin, Wu Shu, Liang Wang, Unified subspace learning for in
41、complete and unlabeled multi-view data, Pattern Recogn. 67 (2017) 313 - 327.43 Changqing Zhang, Yajie Cui, Zongbo Han, Joey Tianyi Zhou, Huazhu Fu, Qinghua Hu, Deep partial multi-view learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.44 Zhengyou Zhang, Microsoft Kinect S
42、ensor and Its Effect, IEEE Multimedia 19 (2012)4- 10.45 Rui Zhao, Wanru Xu, Hui Su, Qiang Ji, Bayesian hierarchical dynamic model forhuman action recognition, Proceedings of CVPR 2019, pp. 7733 - 7742.46 Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired image-to- imagetranslation u
43、sing cycle-consistent adversarial networks, Proceedings of ICCV 2017, pp.2223 - 2232中文摘要:随着实际应用中大量数据以多视图的形式出现或被捕获,多视图动作识别近年来受到了广 泛关注,因为它利用了不同视图中某些互补和异构的信息来促进下游任务的完成。然而,大多数现有的方 法都假设多视图数据是完整的,这在现实世界的应用程序中可能并不总是能够满足。为此,本文提出了一 种新的视图知识转移网络(VKTNet)来处理多视图动作识别,特别是在某些视图是不完整的情况下。具体 来说,视图知识转移是利用条件生成对抗网络(cGAN)来
44、重现每个视图的潜在表示,并以另一个视图的信 息为条件。因此,可以有效地提取高层语义特征,以弥合两个不同视图之间的语义鸿沟。此外,为了有效 地融合每个视图获得的决策结果,提出了李生神经网络(SSN),而不是简单地使用分类器。实验结果表明, 在三个公共数据集上,当所有视图都可用时;我们的模型取得了优于其他模型的性能。同时,在缺少某些 视图的情况下,我们的模型也能避免性能的下降。英文摘要: As many data in practical applications occur or can be captured in multiple views form, multi-view action
45、recognition has received much attention recently, due to utilizing certain complementary and heterogeneous information in various views to promote the downstream task. However, most existing methods assume that multiview data is complete, which may not always be met in real-world applications. To th
46、is end, in this paper, a novel View Knowledge Transfer Network (VKTNet) is proposed to handle multi-view action recognition, even when some views are incomplete. Specifically, the view knowledge transferring is utilized using conditional generative adversarial network(cGAN) to reproduce each views latent representation, conditioning on the other view*s information. As such, the high-level semantic features are effectively extracted to bridge the semantic gap betweentwo different views. In addition, in order to efficiently fuse the decision result achiev