comp9517代写-M6KEDE|学霸联盟

comp9517代写-M6KEDE

时间：2022-12-07

Computer Vision and Image Understanding 222 (2022) 103484
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
E
D
aMb
A
M6KEDE
1
fields such as video surveillance (Khan et al., 2020), automatic driving,ainWaKec
ifsrFmpmsp
fi
areas by mistake, and the Fast pathway almost always gives equal
hRA1nd so on. The performance of action recognition has been dramaticallymproved recently thanks to the advance of deep convolutional neuraletworks (Feichtenhofer et al., 2019; Simonyan and Zisserman, 2014;ang et al., 2016; Christoph and Pinz, 2016; Feichtenhofer et al., 2016)nd well-annotated large scale datasets (Carreira and Zisserman, 2017;ay et al., 2017; Carreira et al., 2018; Soomro et al., 2012; Materzynskat al., 2019). However, the drive to improve accuracy often comes at aost of high computation and storage overhead.Considering the common sense that video data mainly differ withmage data in the temporal dimension, recent state-of-the-art worksocus on designing appropriate architectures to learn the temporal andpatial features responding to movements and appearance in videos,espectively (Wang et al., 2016; Christoph and Pinz, 2016). Slow-ast (Feichtenhofer et al., 2019) innovatively proposed a two-streamodel to capture semantic information with the slow sparse framesathway and to capture temporal speed as well as rapidly changingotion with the fast dense frames pathway. They also implement aimple convolution layer to fuse the motion information from the fastathway to the slow one after each stage.Although this idea works very well, it could be very challengingor neural networks to adaptively learn specific temporal and spatialnformation in the Fast and Slow pathways without other explicit
∗ Corresponding authors.E-mail addresses: pushiliang.hri@hikvision.com (S. Pu), htlu@sjtu.edu.cn (H. Lu).
attention to each pixel. This insufficiency of focusing on the criticalarea of recognition action class may cause performance degradation.Besides, nearly all the 3D CNN architectures, including SlowFast,are heavyweight, requiring 10 and even 100 billions of floating-pointoperations (FLOPs), which places high demands on the hardware.Kopuklu et al. firstly converted the well-known 2D resource-efficient ar-chitectures such as SqueezeNet (Iandola et al., 2016), ShuffleNet (Zhanget al., 2018), MobileNet (Howard et al., 2017), MobileNetV2 (Sandleret al., 2018), and ShuffleNetV2 (Ma et al., 2018) to 3D version alongthe temporal axis by extending the network inputs, features, and filterkernels into space and time. This conversion significantly reduces theparameters and calculation cost. However, these sample lightweightconversions come at the expense of a large performance decrease inaccuracy.To address these issues, firstly, we propose a cross-modality dualattention fusion module (CMDA) equipped with spatial–temporal andchannel-wise attention mechanisms to explicitly transfer informationnot only from the Fast pathway to Slow as in SlowFast, but alsofrom Slow to Fast. Fig. 1(b) demonstrates that our proposed CMDAempowers both Slow and Fast models the ability of learning when andwhere to focus.
ttps://doi.org/10.1016/j.cviu.2022.103484eceived 20 November 2020; Received in revised form 5 April 2022; Accepted 14 June 2022vailable online 21 June 2022077-3142/© 2022 Elsevier Inc. All rights reserved.journal homepage: www.elsevier.com/locate/cviu
fficient dual attention SlowFast networks for video action recognition
afeng Wei a, Ye Tian a, Liqing Wei b, Hong Zhong b, Siqian Chen b, Shiliang Pu b,∗, Hongtao Lu a,∗
Department of Computer Science and Engineering, MOE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, 800 Dongchuan Road,inhang District, Shanghai 200240, ChinaHikvision Research Institute, 555 Qianmo Road, Binjiang District, Hangzhou 310051, China
R T I C L E I N F O
SC:8T45eywords:fficient video action recognitionual attention mechanismfficient SlowFast networks
A B S T R A C T
Video data mainly differ in temporal dimension compared with static image data. Various video actionrecognition networks choose two-stream models to learn spatial and temporal information separately and fusethem to further improve performance. We proposed a cross-modality dual attention fusion module namedCMDA to explicitly exchange spatial–temporal information between two pathways in two-stream SlowFastnetworks. Besides, considering the computational complexity of these heavy models and the low accuracyof existing lightweight models, we proposed several two-stream efficient SlowFast networks based on well-designed efficient 2D networks, such as GhostNet, ShuffleNetV2 and so on. Experiments demonstrate that ourproposed fusion model CMDA improves the performance of SlowFast, and our efficient two-stream modelsachieve a consistent increase in accuracy with a little overhead in FLOPs. Our code and pre-trained modelswill be made available at https://github.com/weidafeng/Efficient-SlowFast.
. Introduction
Video action recognition has a wide range of applications in various
instrumental guidance, respectively. We calculate and visualize thesalience map using the method of Grad-CAM, as illustrated in Fig. 1(a),and observe that the Slow pathway commonly focuses on unimportant
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
SFopE
nMep
tJl
2
2
daafapdopst
mvcti
2
c
ts
3
tTfsfFig. 1. Salient feature map of the SlowFast visualized by Grad-CAM. In figure (a), thelow path of SlowFast fails to focus on the critical part of action recognition and theast path either considers all parts indiscriminately. In figure (b), after replaced withur proposed cross-modality dual attention fusion module, both the Slow and the Fastathways have the power of attention to focus on the keyframe and essential area.xamples are from the Kinetics-400 dataset.
Secondly, we construct several two-stream efficient 3D action recog-ition models based on GhostNet (Han et al., 2020), ShuffleNet,obileNetV2, and ShuffleNetV2 following the idea of SlowFast andquipped with our proposed CMDA fusion module to achieve bettererformance and real-time video action recognition.To compare to previous research and for systematic studies, we usehe Kinetics-400 dataset (Carreira and Zisserman, 2017) and 20BN-ester-v1 (Materzynska et al., 2019). We also consider several differentevels of complexity to make a fair comparison.
. Related work
.1. Two-stream method for action recognition
The main difference between a static image and a dynamic videoata is the temporal dimension. Many works focused on this pointnd proposed several effective two-stream strategies to extract andggregate spatial and temporal features using optical flow or just RGBrames (Simonyan and Zisserman, 2014; Wang et al., 2016; Christophnd Pinz, 2016). Inspired by the concept of two different kinds ofrimate eyes cells (P and M) (Hubel and Wiesel, 1965) operating atifferent temporal frequencies, SlowFast (Feichtenhofer et al., 2019)riginally proposed a method using different temporal speed in the twoathways, excepting that the heavy Slow path focusing on the spatialemantic information and the lightweight Fast path focusing on theemporal motion information.This two-stream SlowFast CNN captures the temporal speed infor-ation as well as spatial features and achieves significant gains onarious tasks. However, without explicit guidance, it could be veryhallenging for the Slow path to know where to look and the Fast patho focus on the keyframes. This leaves great potential in performancemprovement.
.2. Fusion module across two modalities
Spatiotemporal fusion is essential for the two-stream method be-ause it controls how spatial and temporal signals are extracted at t
2each layer, and how these signals are transported and interacted witheach other (Zhou et al., 2020). Several works use LSTM (Hochreiterand Schmidhuber, 1997) to fuse and discover long-range temporalrelationships such as (Ng et al., 2015; Li et al., 2016). Overall, existingapproaches include later fusion, early fusion, and finer-grained fusionin each stage (Feichtenhofer et al., 2016; Simonyan and Zisserman,2014; Wang et al., 2016).Instead of simply using one or two convolutional layers, morestate-of-the-art networks like YOWO (Köpüklü et al., 2019) and Non-Local (Wang et al., 2018) attempt to introduce attention mechanismsto better merge the different types of information and further improveperformance in some works (including SlowFast). Attention mecha-nisms including spatial–temporal and channel-wise aim to strengthenthe most meaningful areas and channels, respectively, and weakenthe others at the same time. Our work proposed a well-designedcross-modality fusion model equipped with 3D spatial–temporal self-attention (Vaswani et al., 2017) and efficient channel-wise atten-tion (Wang et al., 2020), which empowers SlowFast the confidence ofwhen and where to pay close attention.
2.3. Efficient action recognition
Due to the difficulty of the action recognition task, researchers haveto build heavy and complex networks to improve accuracy, which needcomputing and storage overhead (Girdhar et al., 2019; Hussein et al.,2019). Recently, more and more works take efficiency into consider-ation and proposed several works in replacing 3D to (2+1)D (Tranet al., 2018), shuffling features to mimic temporal information (Linet al., 2019), learning to construct motion flow to avoid the cal-culation of optical flow (Sun et al., 2018), and designing efficient3D backbones (Kopuklu et al., 2019) etc. Feichtenhofer (2020) pro-gressively expand a tiny 2D image classification architecture alongmultiple network axes, in space, time, width and depth to explorecomputation/accuracy trade-off for video recognition. Lin et al. (2019)introduced a temporal shift module to extend ResNet to capture tempo-ral information using memory shifting operations. Kopuklu et al. (2019)directly convert several classical 2D image classification networks to3D along the temporal axis. However, the accuracy drops significantlyalong with the parameters and FLOPs reduction compared to heavynetworks. The main reason may lie in the insufficiency in temporalrepresenting and learning ability of the lightweight simple networks.Our work follows the concept of SlowFast and we proposed sev-eral efficient two-stream 3D networks based on lightweight GhostNet,ShuffleNet, MobileNetV2, and ShuffleNetV2. Experiments demonstratethe consistent improvement in accuracy and performance of our modelsover the SOTA methods.
3. Methods
In this section, we will first present our fusion module named CMDAand then introduce our efficient SlowFast models in detail. Formally,we denote the feature map of the Slow pathway as [,, ,, ], thefeature map of the Fast pathway as [, , ,, ], where is thebatch size, is the number of convolution channels, , , are theemporal length, spatial height and width, respectively. and arepeed and channel ratio following the definition of SlowFast.
.1. Cross-modality dual attention fusion module (CMDA)
SlowFast built a solid baseline in various action datasets based onhe concept of different temporal frequency Slow and Fast pathways.hey tried three simple unidirectional methods to flow informationrom the Fast pathway to Slow one, using whether transpose, down-ampling, or single convolution layer. Their experiments found thatusing Slow and Fast pathways with various types of lateral connectionshroughout the network hierarchy is consistently better than the Slow
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
). Wet d comi annel-
atllearned from the Slow path. Also, the information fused from the Fastpmo
n(ctpt
3
sBo3tm

The details of these two equations are explained below.athway may be oblique or indirect for the Slow pathway througherely one Conv-layer transformation. Fig. 1 illustrates some examplesf these shortcomings.In general, there are several design principles that the fusion moduleeeds to adhere to: (1) sufficient and explicit information exchange,2) fast computing and resource efficiency. To satisfy (1), our proposedross-modality dual attention fusion module contains two kinds of at-ention mechanisms, which we will introduce in detail. Then a delicaterocedure is designed to execute heavy calculation in fewer parameterso meet (2). Fig. 2 shows the overview of our proposed CMDA.
.1.1. Attention mechanismsHumans exploit a sequence of glimpses and selectively focus onalient parts and areas to capture temporal and visual structure better.esides, attention mechanism not only tells when and where to focusn but also improves the representation of interests. We implement aD self-attention module to calculate spatial–temporal attention map inhe lateral path from Slow to Fast and an efficient 3D channel attentionodule to assign importance weights to different channels, i.e.( )
Spatial–temporal attention. Different from the 2D static imageversion which only calculates spatial attention, we expand self-attentionto 3D along the temporal axis to calculate spatial–temporal attention.Given the input feature [,, ,, ], we calculate three matrices named
, , through 1 ∗ 1 ∗ 1 convolution, and dot product utilizingthe concept of self-attention in (Vaswani et al., 2017) in Eq. (1). Asshown in Fig. 3, the main difference with the self-attention in Vaswaniet al. (2017) is that we consider the temporal dimension and thereforeextract both spatial and temporal attention. To further reduce thecomputation of this block, we follow the bottleneck design of He et al.(2016) to reduce the feature map of the and pathways with atunable reduction parameter. We also introduce a residual connectionto guarantee the learning ability of the model. Therefore, we definethe output [,, ,, ] of this spatial–temporal attention module usingEq. (2), where the + denotes element-wise sum, i.e. residual pathto preserve the static background information. Note that the featuresize maintains throughout the block, so this module can be directlyplugged into many other spatial–temporal networks to further boostperformance with the help of attention mechanism.Efficient channel attention. The performance of channel attentionFig. 3. Spatial–temporal attention module. ⊗ denotes matrix multiplication, and ⊕ denotes element-wise sum.
nd Fast only baselines. But it could not be sufficient for such methodso exchange enough information with each other, for example, theightweight Fast path cannot get any instrumental spatial guidance [,, ,, ] = (,, ) + [,, ,, ] (2)Fig. 2. Overview of our proposed cross-modality dual attention fusion model (CMDAemporal and efficient channel-wise attention modules. To further reduce parameters ann a smaller feature map. SA and CA mean spatial–temporal attention and efficient ch(,, ) = (1)
3exchange information between the Slow pathway and Fast pathway through spatial–putational overhead, we carefully designed the pipeline to perform heavy computationwise attention respectively.has also been widely verified in various works (Köpüklü et al., 2019;
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
H2FcswgTpche2
Fig. 4. Efficient channel attention module. ◦ denotes Hadamard product.
u et al., 2018). To lower the model complexity, ECA (Wang et al.,020) investigates a 1D Conv with an adaptive kernel size to replaceC layers in channel attention module, aiming at capturing local cross-hannel interaction efficiently. We reconstruct it to 3D version toupport our video action scenarios. Specifically, as shown in Fig. 4,e remove spatial and temporal dimensions through 3D channel-wiselobal average pooling (Eq. (3)) and keep channel information only.hen we squeeze and transpose the channel-wise feature map anderform a 1D Conv and sigmoid function on the squeezed feature toalculate the channel attention. The kernel size of the 1D Conv controlsow many the neighbors to consider, which is both more efficient andffective than the fully convolutional layer of the SENet (Hu et al.,018).
( [,, ,, ]) = 1
∗ ∗
,,∑
,,
,, (3)
Moreover, unlike the SENet, we add a residual path to preservethe static background information rather than suppressing them com-pletely. We also maintain the size of the input feature map, whichcould be easily equipped to learn and capture channel attention foreach convolution block.
3.1.2. Fusion methodsAs illustrated in Fig. 2, to fuse the information between the Slowand the Fast pathways, we should match the feature sizes first, i.e. thedimensions of and in [,, ,, ]. Fig. 5 shows the differencebetween the original SlowFast networks and our proposed cross-modeldual attention networks equipped with CMDA fusion modules. Weexchange information between the Slow and the Fast pathways, whilethe original networks merely flow feature unidirectionally from Fast toSlow.Fusion from fast to slow. In this direction, the Fast pathway hasmore frames than the Slow pathway, i.e. times than the Slow path,and we could simply downsample every frames along the temporaldimension in the Fast pathway without any parameters and compu-tational overhead. This simple operation has good interpretability,i.e. sequential frames extraction, because SlowFast maintains high tem-poral resolution throughout the Fast way. However, this equal intervalssampling may miss the frames that are crucial to identifying actionclass and we cannot guarantee that keyframes will be obtained everytime. Therefore, we instead use 3D temporal max-pooling with =
[, 1, 1], and _ = [, 1, 1]. This operation only considers thetemporal dimension and takes no account of the spatial features, whichalso not breaks the interpretability of temporal downsampling.In this way, we obtain a spatial–temporal salience map in a broadersense, and each pixel in our resulting feature map is the most salientamong the frames efficiently. Then an efficient channel-wise attentionmap is calculated to further improve the performance. Note that we donot change the channel dimension for the reason that the Fast pathwayis designed with low channel capacity, i.e. times ( < 1) smallerthan the Slow pathway to keep lightweight, and we could directlyconcatenate the lateral fusion path to the next stage of origin Slowpathway with minor parameter overhead.Fusion from slow to fast. Different from the temporal dimen-sion, the channel dimension has no interpretable downsampling sense.
4We follow the common method and use a single 1 ∗ 1 ∗ 1 Convlayer to reduce and match channel size. After that, we calculate thespatial–temporal attention map immediately before further operations,i.e. we conduct the operations with higher computational cost whenthe feature map is smaller by optimizing the execution order.Then we upsample the feature map along the temporal axis with
_ = [, 1, 1]. To reduce computation and maintain theinterpretability, we use the nearest sampling method. So our fusionmodule is effective and efficient. Finally, we concatenate the fusionfeatures with the original Fast pathway.
3.2. Efficient 3D SlowFast networks
There have been great advances recently to build resource-efficient2D CNN architectures considering memory and power budget, such asGhostNet, MobileNet, ShuffleNet, MobileNetV2, and ShuffleNetV2.Kopuklu et al. created the 3D versions of these well-known 2D effi-cient architectures with low computational complexity, but inevitably,this model cannot achieve competitive performance compared to largemodels.We attributed the problem to the difficulty of the action recognitiontask, the simplicity of the model, and the implicit utilization of tem-poral information. Following the success of the SlowFast two-streamnetworks, we designed several efficient 3D SlowFast architectures.In general, our efficient SlowFast networks can be described as asingle stream architecture that operates at two different framerates,and exchange information bidirectionally after each stage using ourproposed CMDA fusion module. The Slow path with more channelsoperates at lower temporal speed to focus more on spatial and semanticinformation, and the lightweight Fast path with consistent temporaldimension throughout the network is driven to pursue dynamic mo-tion information. We build our SlowFast two-stream version basedon Kopuklu et al. (2019) but remove the temporal pooling until theglobal pooling layer before classification.Specifically, taking GhostNet as an example, this brilliant workproposed a novel Ghost Module to generate more features from cheap
1 ∗ 1 depth-wise convolution which could fully reveal information un-derlying intrinsic features. We convert it to a 3D version carefully andkeep temporal dimension throughout the network. GhostNet contains5 stages and we fuse information using our CMDA after each stageexcept the last one. After the fusion layer, the input shape only growsin the first layer of the next stage, which ensures the lightweight of thenetwork. Considering the low channel capacity of the fast pathway, weensure all origin layers have a channel number divisible by 4.
4. Experiments
In this section, we first explain the datasets and experimental set-tings. Then we design two series of experiments to evaluate our pro-posed CMDA fusion module and several efficient 3D variants, respec-tively.
4.1. Dataset
We select the widely used Kinetics-400 (Carreira and Zisserman,2017) and 20BN-Jester-v1 (Materzynska et al., 2019) datasets to eval-uate our approaches.The Kinetics-400 dataset contains 400 human action classes. Thevideo clips are around 10 s and trimmed from raw YouTube videos.Note that there are in total 224,919 training clips and 18,525 validationclips in our experiments due to some YouTube video links expired. Thisdataset contains complex scenes and object contents in the format ofvideo clips and has a large temporal dependency.The 20BN-Jester-v1 dataset is a large collection of densely-labeledvideo clips that show humans performing pre-defined hand gestures infront of a laptop camera or webcam. There are in total 148,092 gesture
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
and ouhtwei
typical value of temporal stride = 16, speed ratio = 4, and = 4 which makes it possible to concatenate fewer features. Furthermore, weperform more expensive operations in stages with fewer parameters byin our experiments, i.e. the input clips of the Fast pathway contains 32frames, and the Slow pathway contains 8 frames.To train our efficient SlowFast networks, we follow the recipein Kopuklu et al. (2019), which uses 112 ∗ 112 pixels in spatial and16 frames in the temporal domain. We therefore randomly samplein [125, 160] pixels for the shorter side and scaled to 112 ∗ 112 inthe spatial domain. The Fast and Slow pathway contains 16 and 4frames respectively to make a fair comparison. One thing to note isthat considering the characteristics of gestures, such as Swiping Leftand Swiping Right could be confusing to recognize after flipping, wedo not perform horizontal flipping augmentation in the 20BN-Jester-v1 dataset. Instead, we perform color jittering augmentation includingcontrast, brightness, and saturation to simulate the diversity of indoorlight. We use a momentum of 0.9 and a weight decay of 0.0001. Weadopt dropout after the global pooling layer, with a dropout ratio of0.5 for the Kinetics-400 and 0.2 for the 20BN-Jester-v1 dataset.At test time, we sample 10 clips per video with uniform temporalspacing and each clip takes 3 crops (left, middle, right) of 224 ∗ 224to cover the spatial dimensions as (Feichtenhofer et al., 2019; Wanget al., 2018). For efficient SlowFast networks, we crop 112 ∗ 112 in thespatial domain. We combine the predictions with average pooling for
optimizing the process in our fusion module. Besides, we calculate andplot the salient feature map using Grad-CAM in Fig. 1. As expected,with the help of attention mechanisms and the explicit bidirectionalinformation exchanging paths, the Slow pathway can focus better onthe important contextual semantics area and the Fast path has also beenendowed with the ability of sense perception. Benefit from the enhance-ment of both two pathways, our network achieves better performanceunsurprisingly.
4.4. Light-weight SlowFast results
We convert several classical efficient networks to 3D following theconcept of Kopuklu et al. (2019) aside from keeping temporal resolu-tion throughout the network. Then we construct them to two-streamcounterparts following the design philosophy of SlowFast networks.Table 2 makes a comprehensive comparison on Kinetics-400, whereour SlowFast networks equipped with the CMDA outperform the recentlightweight SOTA approaches. Compared with baseline networks, ourmodels have slightly more parameters and FLOPs mainly for the reasonthat we maintain the temporal dimension along with the network untilFig. 5. Comparison of the pipeline of the original SlowFast networks (the first row)
⊕ denotes concatenate. We can easily substitute with various basic stages from any ligof GhostNet and so on.
videos under 27 classes. And the training set contains 118,562 videos,while the validation set contains 14,787 videos. Due to the lack ofground-truth labels, we do not use the test set following the mainstreamapproaches. Note that there are only 5 of 27 classes (i.e. DrummingFingers, Thumb Up, Thumb Down, Stop Sign, Shaking Hand) that canbe described as static gestures and recognized from a single frame. Allother categories require distinguishing between fine-grained motion de-tails (such as Rolling Hand Backward, Rolling Hand Forward, SwipingDown, Swiping Up and so on), which means that the 20BN-Jester-v1dataset is suitable to inspect the ability of the networks in capturingmotion patterns (Materzynska et al., 2019; Kopuklu et al., 2019).
4.2. Training details
All our models are trained from scratch without using any pre-training. To validate our proposed fusion module CMDA, we followthe recipe in SlowFast to pre-process, train and test. For the Kinetics-400 dataset, we use the spatial size of 224 ∗ 224, which is randomlycropped from a scaled video whose shorter side is randomly sampledin [256, 320] pixels. For the temporal domain, we randomly sample aclip (of ∗ ∗ frames) from the full-length video, and the input tothe Slow and Fast pathways are respectively and ∗ frames. Aprediction.
5r efficient SlowFast networks (the second row) equipped with CMDA fusion module.ght networks such as the inverted residual module of MobileNetV2, the ghost module
Table 1Comparisons with SOTA of SlowFast on Kinetics-400 validation set. Our dual-attentionmodification equipped with the CMDA outperforms the counterpart SlowFast networkswith a higher accuracy, and uses less parameters and calculations due to thewell-designed processes.Model #Params #FLOPs Top-1 Acc Top-5 Acc
SlowFast 34.57 M 50.31 G 77.0 92.6Ours 33.94 M 49.64 G 77.4 93.0
4.3. Fusion module results
As described in Fig. 5, we only substitute the origin unidirectionalfusion module with our proposed bidirectional CMDA module. Table 1compares these two models in several dimensions.Our proposed CMDA fusion module achieves better performancein accuracy. Notably, our results are achieved at much lower FLOPsand fewer parameters than the original SlowFast networks, which ismainly due to our efficient design in CMDA. Even fusing informa-tion bidirectionally rather than unidirectionally, our method selectssalient information by spatial–temporal and channel-wise attention,the last fully connected layer.
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
ed twoxamplns can
csoeeaftoproo
sion module consistently improves the performance of the SlowFast
We also conduct experiments on the 20BN-Jester-v1 dataset andompare runtime performance with the baseline methods. Most of thetate-of-the-art methods are heavy backbone architectures and employptical flow to achieve higher accuracy, which is computationallyxpensive to run in real-time. In contrast, as illustrated in Table 3, ourfficient models enjoy a much higher FPS while maintain competitiveccuracy. Considering the action body sitting in the middle of therames, which is much easy to focus on and recognition, the role of at-ention mechanism is limited and not as significant as the improvementn Kinetics-400. Note that the MobileNetV2 has the lowest runtimeerformance compared with ShuffleNet and ShuffleNetV2. The maineason for this observation is that the CUDNN library is well optimizedn standard convolution, while the MobileNetV2 uses a large number
networks while maintaining the low computation cost.
CRediT authorship contribution statement
Dafeng Wei: Conceptualization, Methodology, Writing – originaldraft. Ye Tian: Visualization, Investigation. Liqing Wei: Software, Datacuration. Hong Zhong: Supervision, Formal analysis. Siqian Chen: Re-sources. Shiliang Pu: Project administration. Hongtao Lu: Supervision,Writing – review & editing.
Declaration of competing interest
The authors declare that they have no known competing finan-cial interests or personal relationships that could have appeared toFig. 6. Activation maps for our proposed SlowFastMobileNetV2 backbone. We visualizprecisely on fine spatial information, such as the fingers in the 3rd frame of the first eshow the first 8 frames of the Fast pathway due to space constraints. More explanatio
Table 2Comparison with the state-of-the-art light-weight action recognition methods onKinetics-400. Our approaches achieve consistent and remarkable improvements basedon all kinds of classical efficient networks. The number of floating point operations(FLOPs) is calculated for 16 frames with spatial resolution of 112*112. Detailed analysiscan be found in Section 4.4.Modela Ours #Params #FLOPs Top-1 Acc Top-5 Acc
3D ShuffleNetV2 0.25X 0.62 M 42 M 24.12 49.42SlowFastShuffleNetV2 0.25X ✓ 0.71 M 66 M 28.79 54.773D ShuffleNetV2 1.0X 1.71 M 119 M 38.54 65.00SlowFastShuffleNetV2 1.0X ✓ 1.95 M 234 M 47.26 73.133D ShuffleNetV2 2.0X 6.23 M 360 M 48.00 73.96SlowFastShuffleNetV2 2.0X ✓ 7.06 M 802 M 54.22 78.02
3D ShuffleNet 2.0X G3 4.37 M 386 M 51.06 76.40SlowFastShuffleNet 2.0x G3 ✓ 6.41 M 710 M 53.84 77.383D ShuffleNet 2.0X G1 4.19 M 406 M 50.19 74.44SlowFastShuffleNet 2.0X G1 ✓ 5.70 M 530 M 54.99 78.45
SlowFastGhostNet 1.0X ✓ 4.65 M 36 M 46.03 72.66
3D MoibleNetV2 1.0X 2.87 M 445 M 38.54 66.05SlowFastMoibleNetV2 1.0X ✓ 10.78 M 1621 M 48.12 73.68
aSince the baseline network (Kopuklu et al., 2019) only provides the results on theKinetics-600 dataset, we train and report results on the Kinetics-400 dataset followingthe official code and recipe.
Table 3Accuracy and speed comparisons on 20BN-Jester-v1 dataset. The frames per second(FPS) considers the forward process with spatial resolution of 112*112, and is measuredon a single NVIDIA Tesla P40 GPU.Model Ours Top-1 Acc FPS
3D ShuffleNet 2.0X G3 93.54 139SlowFastShuffleNet 2.0x G3 ✓ 93.72 204
3D ShuffleNetV2 2.0X 93.71 123SlowFastShuffleNetV2 2.0x ✓ 94.17 196
3D MobileNetV2 1.0x 94.59 126SlowFastMoibleNetV2 1.0x ✓ 94.62 115f deep-wise convolution (Kopuklu et al., 2019; Xie et al., 2017).
6different examples from the 20BN-Jester-v1 dataset. The Slow pathway focuses moree, and the Fast pathway is also endowed with the ability of spatial attention. We onlybe found in Section 4.5.
4.5. Visualization results
To better understand the behavior of our proposed cross-modalitydual attention models, we visualize the activation maps based on theGrad-CAM method in Fig. 6. The Slow pathway is designed to focus onfine-grained spatial information and the Fast pathway on long-rangetemporal motion information.As expected, in the Slow pathway of the first example, our networkperforms very well on not only the hand-wise but also on fingers-wise.And with the help of motion information fused from the Fast pathway,the slow part of the second example knows to focus on the curved arcarea that the fingers swiped across.From the visualization, we can also discover that the Fast pathwayhas the ability to pay attention to the core and contextual views,which is transferred from the Slow pathway and essential for videoclassification.These visual results verify that our proposed CMDA indeed can fusespatial and temporal information from the Slow and Fast pathways andbenefit each other to further improve the performance of video actionclassification.
5. Conclusion
We have proposed a new fusion strategy equipped with both spatial–temporal attention and channel-wise attention mechanisms to explicitlyexchange information between the Slow and the Fast pathways of theSlowFast networks. Following the concept of the SlowFast networks,we developed several efficient two-stream action recognition modelsbased on well-designed GhostNet, ShuffleNet, ShuffleNetV2 and Mo-bileNetV2. The salience maps plotted by the Grad-CAM verify thatthe proposed fusion modality successfully empowers both pathwaysthe attention ability. Experiments demonstrate that our proposed fu-influence the work reported in this paper.
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
Acknowledgments
This paper is supported by NSFC (No. 62176155, 62066002), Shang-hai Municipal Science and Technology Major Project, China, undergrant no. 2021SHZDZX0102.
References
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A., 2018. A shortnote about kinetics-600. arXiv preprint arXiv:1808.01340.Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and thekinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 6299–6308.Christoph, R., Pinz, F.A., 2016. Spatiotemporal residual networks for video actionrecognition. Adv. Neural Inf. Process. Syst. 3468–3476.Feichtenhofer, C., 2020. X3D: Expanding architectures for efficient video recognition.In: Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition. CVPR.Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast networks for videorecognition. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 6202–6211.Feichtenhofer, C., Pinz, A., Zisserman, A., 2016. Convolutional two-stream networkfusion for video action recognition. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 1933–1941.Girdhar, R., Carreira, J., Doersch, C., Zisserman, A., 2019. Video action transformernetwork. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation /IEEE, pp. 244–253.Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C., 2020. GhostNet: More featuresfrom cheap operations. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. pp. 1580–1589.He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.In: IEEE Conference on Computer Vision and Pattern Recognition.Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),1735–1780.Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural networksfor mobile vision applications. arXiv preprint arXiv:1704.04861.Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141.Hubel, D.H., Wiesel, T.N., 1965. Receptive fields and functional architecture intwo nonstriate visual areas (18 and 19) of the cat. J. Neurophysiol. 28 (2),229–289.Hussein, N., Gavves, E., Smeulders, A.W.M., 2019. Timeception for complex actionrecognition. In: IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation/ IEEE, pp. 254–263.Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K., 2016.SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB modelsize. arXiv preprint arXiv:1602.07360.Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S.,Viola, F., Green, T., Back, T., Natsev, P., et al., 2017. The kinetics human actionvideo dataset. arXiv preprint arXiv:1705.06950.Khan, M.A., Javed, K., Khan, S.A., Saba, T., Habib, U., Khan, J.A., Abbasi, A.A.,2020. Human action recognition using fusion of multiview and deep features: anapplication to video surveillance. Multimedia Tools Appl. 1–27.7Kopuklu, O., Kose, N., Gunduz, A., Rigoll, G., 2019. Resource efficient 3D convolutionalneural networks. In: Proceedings of the IEEE International Conference on ComputerVision Workshops.Köpüklü, O., Wei, X., Rigoll, G., 2019. You only watch once: A unified CNN architecturefor real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644.Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J., 2016. Online human action de-tection using joint classification-regression recurrent neural networks. In: EuropeanConference on Computer Vision.Lin, J., Gan, C., Han, S., 2019. Tsm: Temporal shift module for efficient videounderstanding. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 7083–7093.Ma, N., Zhang, X., Zheng, H.-T., Sun, J., 2018. Shufflenet v2: Practical guidelines forefficient cnn architecture design. In: Proceedings of the European Conference onComputer Vision. ECCV, pp. 116–131.Materzynska, J., Berger, G., Bax, I., Memisevic, R., 2019. The jester dataset: A large-scale video dataset of human gestures. In: 2019 IEEE/CVF International Conferenceon Computer Vision Workshop. ICCVW, pp. 2874–2882.Ng, Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.,2015. Beyond short snippets: Deep networks for video classification.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C., 2018. Mobilenetv2:Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4510–4520.Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for actionrecognition in videos. In: Advances in Neural Information Processing Systems. pp.568–576.Soomro, K., Zamir, A.R., Shah, M., 2012. UCF101: A dataset of 101 human actionsclasses from videos in the wild. arXiv preprint arXiv:1212.0402.Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W., 2018. Optical flow guided feature:A fast and robust motion representation for video action recognition. In: 2018 IEEEConference on Computer Vision and Pattern Recognition, CVPR 2018, Salt LakeCity, UT, USA, June 18-22, 2018. IEEE Computer Society, pp. 1390–1399.Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer lookat spatiotemporal convolutions for action recognition. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 6450–6459.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural InformationProcessing Systems. pp. 5998–6008.Wang, X., Girshick, R., Gupta, A., He, K., 2018. Non-local neural networks. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 7794–7803.Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q., 2020. ECA-net: Efficient channelattention for deep convolutional neural networks. In: Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 11534–11542.Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L., 2016.Temporal segment networks: Towards good practices for deep action recognition.In: European Conference on Computer Vision. Springer, pp. 20–36.Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual transfor-mations for deep neural networks. In: 2017 IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEEComputer Society, pp. 5987–5995.Zhang, X., Zhou, X., Lin, M., Sun, J., 2018. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 6848–6856.Zhou, Y., Sun, X., Luo, C., Zha, Z.-J., Zeng, W., 2020. Spatiotemporal fusion in 3D CNNs:A probabilistic view. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. pp. 9829–9838.