comp9517代写-M6KEDE
时间:2022-12-07
Computer Vision and Image Understanding 222 (2022) 103484
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
E
D
aMb
A
M6KEDE
1
fields such as video surveillance (Khan et al., 2020), automatic driving,ainWaKec
ifsrFmpmsp
fi
areas by mistake, and the Fast pathway almost always gives equal
hRA1nd
so on. The performance of action recognition has been
dramaticallymproved recently thanks to the advance of deep convolutional
neuraletworks (Feichtenhofer et al., 2019; Simonyan and Zisserman,
2014;ang et al., 2016; Christoph and Pinz, 2016; Feichtenhofer et al.,
2016)nd well-annotated large scale datasets (Carreira and Zisserman,
2017;ay et al., 2017; Carreira et al., 2018; Soomro et al., 2012;
Materzynskat al., 2019). However, the drive to improve accuracy often
comes at aost of high computation and storage overhead.Considering the
common sense that video data mainly differ withmage data in the temporal
dimension, recent state-of-the-art worksocus on designing appropriate
architectures to learn the temporal andpatial features responding to
movements and appearance in videos,espectively (Wang et al., 2016;
Christoph and Pinz, 2016). Slow-ast (Feichtenhofer et al., 2019)
innovatively proposed a two-streamodel to capture semantic information
with the slow sparse framesathway and to capture temporal speed as well
as rapidly changingotion with the fast dense frames pathway. They also
implement aimple convolution layer to fuse the motion information from
the fastathway to the slow one after each stage.Although this idea works
very well, it could be very challengingor neural networks to adaptively
learn specific temporal and spatialnformation in the Fast and Slow
pathways without other explicit
∗ Corresponding authors.E-mail addresses: pushiliang.hri@hikvision.com (S. Pu), htlu@sjtu.edu.cn (H. Lu).
attention
to each pixel. This insufficiency of focusing on the criticalarea of
recognition action class may cause performance degradation.Besides,
nearly all the 3D CNN architectures, including SlowFast,are heavyweight,
requiring 10 and even 100 billions of floating-pointoperations (FLOPs),
which places high demands on the hardware.Kopuklu et al. firstly
converted the well-known 2D resource-efficient ar-chitectures such as
SqueezeNet (Iandola et al., 2016), ShuffleNet (Zhanget al., 2018),
MobileNet (Howard et al., 2017), MobileNetV2 (Sandleret al., 2018), and
ShuffleNetV2 (Ma et al., 2018) to 3D version alongthe temporal axis by
extending the network inputs, features, and filterkernels into space and
time. This conversion significantly reduces theparameters and
calculation cost. However, these sample lightweightconversions come at
the expense of a large performance decrease inaccuracy.To address these
issues, firstly, we propose a cross-modality dualattention fusion module
(CMDA) equipped with spatial–temporal andchannel-wise attention
mechanisms to explicitly transfer informationnot only from the Fast
pathway to Slow as in SlowFast, but alsofrom Slow to Fast. Fig. 1(b)
demonstrates that our proposed CMDAempowers both Slow and Fast models
the ability of learning when andwhere to focus.
ttps://doi.org/10.1016/j.cviu.2022.103484eceived
20 November 2020; Received in revised form 5 April 2022; Accepted 14
June 2022vailable online 21 June 2022077-3142/© 2022 Elsevier Inc. All
rights reserved.journal homepage: www.elsevier.com/locate/cviu
fficient dual attention SlowFast networks for video action recognition
afeng Wei a, Ye Tian a, Liqing Wei b, Hong Zhong b, Siqian Chen b, Shiliang Pu b,∗, Hongtao Lu a,∗
Department
of Computer Science and Engineering, MOE Key Lab of Artificial
Intelligence, AI Institute, Shanghai Jiao Tong University, 800 Dongchuan
Road,inhang District, Shanghai 200240, ChinaHikvision Research
Institute, 555 Qianmo Road, Binjiang District, Hangzhou 310051, China
R T I C L E I N F O
SC:8T45eywords:fficient video action recognitionual attention mechanismfficient SlowFast networks
A B S T R A C T
Video
data mainly differ in temporal dimension compared with static image
data. Various video actionrecognition networks choose two-stream models
to learn spatial and temporal information separately and fusethem to
further improve performance. We proposed a cross-modality dual attention
fusion module namedCMDA to explicitly exchange spatial–temporal
information between two pathways in two-stream SlowFastnetworks.
Besides, considering the computational complexity of these heavy models
and the low accuracyof existing lightweight models, we proposed several
two-stream efficient SlowFast networks based on well-designed efficient
2D networks, such as GhostNet, ShuffleNetV2 and so on. Experiments
demonstrate that ourproposed fusion model CMDA improves the performance
of SlowFast, and our efficient two-stream modelsachieve a consistent
increase in accuracy with a little overhead in FLOPs. Our code and
pre-trained modelswill be made available at
https://github.com/weidafeng/Efficient-SlowFast.
. Introduction
Video action recognition has a wide range of applications in various
instrumental
guidance, respectively. We calculate and visualize thesalience map
using the method of Grad-CAM, as illustrated in Fig. 1(a),and observe
that the Slow pathway commonly focuses on unimportant
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
SFopE
nMep
tJl
2
2
daafapdopst
mvcti
2
c
ts
3
tTfsfFig.
1. Salient feature map of the SlowFast visualized by Grad-CAM. In
figure (a), thelow path of SlowFast fails to focus on the critical part
of action recognition and theast path either considers all parts
indiscriminately. In figure (b), after replaced withur proposed
cross-modality dual attention fusion module, both the Slow and the
Fastathways have the power of attention to focus on the keyframe and
essential area.xamples are from the Kinetics-400 dataset.
Secondly,
we construct several two-stream efficient 3D action recog-ition models
based on GhostNet (Han et al., 2020), ShuffleNet,obileNetV2, and
ShuffleNetV2 following the idea of SlowFast andquipped with our proposed
CMDA fusion module to achieve bettererformance and real-time video
action recognition.To compare to previous research and for systematic
studies, we usehe Kinetics-400 dataset (Carreira and Zisserman, 2017)
and 20BN-ester-v1 (Materzynska et al., 2019). We also consider several
differentevels of complexity to make a fair comparison.
. Related work
.1. Two-stream method for action recognition
The
main difference between a static image and a dynamic videoata is the
temporal dimension. Many works focused on this pointnd proposed several
effective two-stream strategies to extract andggregate spatial and
temporal features using optical flow or just RGBrames (Simonyan and
Zisserman, 2014; Wang et al., 2016; Christophnd Pinz, 2016). Inspired by
the concept of two different kinds ofrimate eyes cells (P and M) (Hubel
and Wiesel, 1965) operating atifferent temporal frequencies, SlowFast
(Feichtenhofer et al., 2019)riginally proposed a method using different
temporal speed in the twoathways, excepting that the heavy Slow path
focusing on the spatialemantic information and the lightweight Fast path
focusing on theemporal motion information.This two-stream SlowFast CNN
captures the temporal speed infor-ation as well as spatial features and
achieves significant gains onarious tasks. However, without explicit
guidance, it could be veryhallenging for the Slow path to know where to
look and the Fast patho focus on the keyframes. This leaves great
potential in performancemprovement.
.2. Fusion module across two modalities
Spatiotemporal
fusion is essential for the two-stream method be-ause it controls how
spatial and temporal signals are extracted at t
2each layer, and how
these signals are transported and interacted witheach other (Zhou et
al., 2020). Several works use LSTM (Hochreiterand Schmidhuber, 1997) to
fuse and discover long-range temporalrelationships such as (Ng et al.,
2015; Li et al., 2016). Overall, existingapproaches include later
fusion, early fusion, and finer-grained fusionin each stage
(Feichtenhofer et al., 2016; Simonyan and Zisserman,2014; Wang et al.,
2016).Instead of simply using one or two convolutional layers,
morestate-of-the-art networks like YOWO (Köpüklü et al., 2019) and
Non-Local (Wang et al., 2018) attempt to introduce attention
mechanismsto better merge the different types of information and further
improveperformance in some works (including SlowFast). Attention
mecha-nisms including spatial–temporal and channel-wise aim to
strengthenthe most meaningful areas and channels, respectively, and
weakenthe others at the same time. Our work proposed a
well-designedcross-modality fusion model equipped with 3D
spatial–temporal self-attention (Vaswani et al., 2017) and efficient
channel-wise atten-tion (Wang et al., 2020), which empowers SlowFast the
confidence ofwhen and where to pay close attention.
2.3. Efficient action recognition
Due
to the difficulty of the action recognition task, researchers haveto
build heavy and complex networks to improve accuracy, which
needcomputing and storage overhead (Girdhar et al., 2019; Hussein et
al.,2019). Recently, more and more works take efficiency into
consider-ation and proposed several works in replacing 3D to (2+1)D
(Tranet al., 2018), shuffling features to mimic temporal information
(Linet al., 2019), learning to construct motion flow to avoid the
cal-culation of optical flow (Sun et al., 2018), and designing
efficient3D backbones (Kopuklu et al., 2019) etc. Feichtenhofer (2020)
pro-gressively expand a tiny 2D image classification architecture
alongmultiple network axes, in space, time, width and depth to
explorecomputation/accuracy trade-off for video recognition. Lin et al.
(2019)introduced a temporal shift module to extend ResNet to capture
tempo-ral information using memory shifting operations. Kopuklu et al.
(2019)directly convert several classical 2D image classification
networks to3D along the temporal axis. However, the accuracy drops
significantlyalong with the parameters and FLOPs reduction compared to
heavynetworks. The main reason may lie in the insufficiency in
temporalrepresenting and learning ability of the lightweight simple
networks.Our work follows the concept of SlowFast and we proposed
sev-eral efficient two-stream 3D networks based on lightweight
GhostNet,ShuffleNet, MobileNetV2, and ShuffleNetV2. Experiments
demonstratethe consistent improvement in accuracy and performance of our
modelsover the SOTA methods.
3. Methods
In this section, we
will first present our fusion module named CMDAand then introduce our
efficient SlowFast models in detail. Formally,we denote the feature map
of the Slow pathway as [,, ,, ], thefeature map of the Fast pathway as
[, , ,, ], where is thebatch size, is the number of convolution
channels, , , are theemporal length, spatial height and width,
respectively. and arepeed and channel ratio following the definition
of SlowFast.
.1. Cross-modality dual attention fusion module (CMDA)
SlowFast
built a solid baseline in various action datasets based onhe concept of
different temporal frequency Slow and Fast pathways.hey tried three
simple unidirectional methods to flow informationrom the Fast pathway to
Slow one, using whether transpose, down-ampling, or single convolution
layer. Their experiments found thatusing Slow and Fast pathways with
various types of lateral connectionshroughout the network hierarchy is
consistently better than the Slow
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
). Wet d comi annel-
atllearned from the Slow path. Also, the information fused from the Fastpmo
n(ctpt
3
sBo3tm
The
details of these two equations are explained below.athway may be
oblique or indirect for the Slow pathway througherely one Conv-layer
transformation. Fig. 1 illustrates some examplesf these shortcomings.In
general, there are several design principles that the fusion moduleeeds
to adhere to: (1) sufficient and explicit information exchange,2) fast
computing and resource efficiency. To satisfy (1), our
proposedross-modality dual attention fusion module contains two kinds of
at-ention mechanisms, which we will introduce in detail. Then a
delicaterocedure is designed to execute heavy calculation in fewer
parameterso meet (2). Fig. 2 shows the overview of our proposed CMDA.
.1.1.
Attention mechanismsHumans exploit a sequence of glimpses and
selectively focus onalient parts and areas to capture temporal and
visual structure better.esides, attention mechanism not only tells when
and where to focusn but also improves the representation of interests.
We implement aD self-attention module to calculate spatial–temporal
attention map inhe lateral path from Slow to Fast and an efficient 3D
channel attentionodule to assign importance weights to different
channels, i.e.( )
Spatial–temporal attention. Different from the 2D
static imageversion which only calculates spatial attention, we expand
self-attentionto 3D along the temporal axis to calculate
spatial–temporal attention.Given the input feature [,, ,, ], we
calculate three matrices named
, , through 1 ∗ 1 ∗ 1 convolution,
and dot product utilizingthe concept of self-attention in (Vaswani et
al., 2017) in Eq. (1). Asshown in Fig. 3, the main difference with the
self-attention in Vaswaniet al. (2017) is that we consider the temporal
dimension and thereforeextract both spatial and temporal attention. To
further reduce thecomputation of this block, we follow the bottleneck
design of He et al.(2016) to reduce the feature map of the and
pathways with atunable reduction parameter. We also introduce a residual
connectionto guarantee the learning ability of the model. Therefore, we
definethe output [,, ,, ] of this spatial–temporal attention module
usingEq. (2), where the + denotes element-wise sum, i.e. residual pathto
preserve the static background information. Note that the featuresize
maintains throughout the block, so this module can be directlyplugged
into many other spatial–temporal networks to further boostperformance
with the help of attention mechanism.Efficient channel attention. The
performance of channel attentionFig. 3. Spatial–temporal attention
module. ⊗ denotes matrix multiplication, and ⊕ denotes element-wise sum.
nd Fast only baselines. But it could not be sufficient for such
methodso exchange enough information with each other, for example,
theightweight Fast path cannot get any instrumental spatial guidance [,,
,, ] = (,, ) + [,, ,, ] (2)Fig. 2. Overview of our proposed
cross-modality dual attention fusion model (CMDAemporal and efficient
channel-wise attention modules. To further reduce parameters ann a
smaller feature map. SA and CA mean spatial–temporal attention and
efficient ch(,, ) = (1)
3exchange information between the Slow
pathway and Fast pathway through spatial–putational overhead, we
carefully designed the pipeline to perform heavy computationwise
attention respectively.has also been widely verified in various works
(Köpüklü et al., 2019;
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
H2FcswgTpche2
Fig. 4. Efficient channel attention module. ◦ denotes Hadamard product.
u
et al., 2018). To lower the model complexity, ECA (Wang et al.,020)
investigates a 1D Conv with an adaptive kernel size to replaceC layers
in channel attention module, aiming at capturing local cross-hannel
interaction efficiently. We reconstruct it to 3D version toupport our
video action scenarios. Specifically, as shown in Fig. 4,e remove
spatial and temporal dimensions through 3D channel-wiselobal average
pooling (Eq. (3)) and keep channel information only.hen we squeeze and
transpose the channel-wise feature map anderform a 1D Conv and sigmoid
function on the squeezed feature toalculate the channel attention. The
kernel size of the 1D Conv controlsow many the neighbors to consider,
which is both more efficient andffective than the fully convolutional
layer of the SENet (Hu et al.,018).
( [,, ,, ]) = 1
∗ ∗
,,∑
,,
,, (3)
Moreover,
unlike the SENet, we add a residual path to preservethe static
background information rather than suppressing them com-pletely. We also
maintain the size of the input feature map, whichcould be easily
equipped to learn and capture channel attention foreach convolution
block.
3.1.2. Fusion methodsAs illustrated in Fig. 2, to fuse the
information between the Slowand the Fast pathways, we should match the
feature sizes first, i.e. thedimensions of and in [,, ,, ]. Fig. 5
shows the differencebetween the original SlowFast networks and our
proposed cross-modeldual attention networks equipped with CMDA fusion
modules. Weexchange information between the Slow and the Fast pathways,
whilethe original networks merely flow feature unidirectionally from
Fast toSlow.Fusion from fast to slow. In this direction, the Fast
pathway hasmore frames than the Slow pathway, i.e. times than the Slow
path,and we could simply downsample every frames along the
temporaldimension in the Fast pathway without any parameters and
compu-tational overhead. This simple operation has good
interpretability,i.e. sequential frames extraction, because SlowFast
maintains high tem-poral resolution throughout the Fast way. However,
this equal intervalssampling may miss the frames that are crucial to
identifying actionclass and we cannot guarantee that keyframes will be
obtained everytime. Therefore, we instead use 3D temporal max-pooling
with =
[, 1, 1], and _ = [, 1, 1]. This operation only considers
thetemporal dimension and takes no account of the spatial features,
whichalso not breaks the interpretability of temporal downsampling.In
this way, we obtain a spatial–temporal salience map in a broadersense,
and each pixel in our resulting feature map is the most salientamong the
frames efficiently. Then an efficient channel-wise attentionmap is
calculated to further improve the performance. Note that we donot change
the channel dimension for the reason that the Fast pathwayis designed
with low channel capacity, i.e. times ( < 1) smallerthan the Slow
pathway to keep lightweight, and we could directlyconcatenate the
lateral fusion path to the next stage of origin Slowpathway with minor
parameter overhead.Fusion from slow to fast. Different from the temporal
dimen-sion, the channel dimension has no interpretable downsampling
sense.
4We follow the common method and use a single 1 ∗ 1 ∗ 1
Convlayer to reduce and match channel size. After that, we calculate
thespatial–temporal attention map immediately before further
operations,i.e. we conduct the operations with higher computational cost
whenthe feature map is smaller by optimizing the execution order.Then
we upsample the feature map along the temporal axis with
_ = [, 1,
1]. To reduce computation and maintain theinterpretability, we use the
nearest sampling method. So our fusionmodule is effective and efficient.
Finally, we concatenate the fusionfeatures with the original Fast
pathway.
3.2. Efficient 3D SlowFast networks
There have been
great advances recently to build resource-efficient2D CNN architectures
considering memory and power budget, such asGhostNet, MobileNet,
ShuffleNet, MobileNetV2, and ShuffleNetV2.Kopuklu et al. created the 3D
versions of these well-known 2D effi-cient architectures with low
computational complexity, but inevitably,this model cannot achieve
competitive performance compared to largemodels.We attributed the
problem to the difficulty of the action recognitiontask, the simplicity
of the model, and the implicit utilization of tem-poral information.
Following the success of the SlowFast two-streamnetworks, we designed
several efficient 3D SlowFast architectures.In general, our efficient
SlowFast networks can be described as asingle stream architecture that
operates at two different framerates,and exchange information
bidirectionally after each stage using ourproposed CMDA fusion module.
The Slow path with more channelsoperates at lower temporal speed to
focus more on spatial and semanticinformation, and the lightweight Fast
path with consistent temporaldimension throughout the network is driven
to pursue dynamic mo-tion information. We build our SlowFast two-stream
version basedon Kopuklu et al. (2019) but remove the temporal pooling
until theglobal pooling layer before classification.Specifically, taking
GhostNet as an example, this brilliant workproposed a novel Ghost
Module to generate more features from cheap
1 ∗ 1 depth-wise
convolution which could fully reveal information un-derlying intrinsic
features. We convert it to a 3D version carefully andkeep temporal
dimension throughout the network. GhostNet contains5 stages and we fuse
information using our CMDA after each stageexcept the last one. After
the fusion layer, the input shape only growsin the first layer of the
next stage, which ensures the lightweight of thenetwork. Considering the
low channel capacity of the fast pathway, weensure all origin layers
have a channel number divisible by 4.
4. Experiments
In this
section, we first explain the datasets and experimental set-tings. Then
we design two series of experiments to evaluate our pro-posed CMDA
fusion module and several efficient 3D variants, respec-tively.
4.1. Dataset
We
select the widely used Kinetics-400 (Carreira and Zisserman,2017) and
20BN-Jester-v1 (Materzynska et al., 2019) datasets to eval-uate our
approaches.The Kinetics-400 dataset contains 400 human action classes.
Thevideo clips are around 10 s and trimmed from raw YouTube videos.Note
that there are in total 224,919 training clips and 18,525
validationclips in our experiments due to some YouTube video links
expired. Thisdataset contains complex scenes and object contents in the
format ofvideo clips and has a large temporal dependency.The
20BN-Jester-v1 dataset is a large collection of densely-labeledvideo
clips that show humans performing pre-defined hand gestures infront of a
laptop camera or webcam. There are in total 148,092 gesture
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
and ouhtwei
typical
value of temporal stride = 16, speed ratio = 4, and = 4 which makes
it possible to concatenate fewer features. Furthermore, weperform more
expensive operations in stages with fewer parameters byin our
experiments, i.e. the input clips of the Fast pathway contains 32frames,
and the Slow pathway contains 8 frames.To train our efficient SlowFast
networks, we follow the recipein Kopuklu et al. (2019), which uses 112 ∗
112 pixels in spatial and16 frames in the temporal domain. We therefore
randomly samplein [125, 160] pixels for the shorter side and scaled to
112 ∗ 112 inthe spatial domain. The Fast and Slow pathway contains 16
and 4frames respectively to make a fair comparison. One thing to note
isthat considering the characteristics of gestures, such as Swiping
Leftand Swiping Right could be confusing to recognize after flipping,
wedo not perform horizontal flipping augmentation in the 20BN-Jester-v1
dataset. Instead, we perform color jittering augmentation
includingcontrast, brightness, and saturation to simulate the diversity
of indoorlight. We use a momentum of 0.9 and a weight decay of 0.0001.
Weadopt dropout after the global pooling layer, with a dropout ratio
of0.5 for the Kinetics-400 and 0.2 for the 20BN-Jester-v1 dataset.At
test time, we sample 10 clips per video with uniform temporalspacing and
each clip takes 3 crops (left, middle, right) of 224 ∗ 224to cover the
spatial dimensions as (Feichtenhofer et al., 2019; Wanget al., 2018).
For efficient SlowFast networks, we crop 112 ∗ 112 in thespatial domain.
We combine the predictions with average pooling for
optimizing the
process in our fusion module. Besides, we calculate andplot the salient
feature map using Grad-CAM in Fig. 1. As expected,with the help of
attention mechanisms and the explicit bidirectionalinformation
exchanging paths, the Slow pathway can focus better onthe important
contextual semantics area and the Fast path has also beenendowed with
the ability of sense perception. Benefit from the enhance-ment of both
two pathways, our network achieves better performanceunsurprisingly.
4.4. Light-weight SlowFast results
We
convert several classical efficient networks to 3D following theconcept
of Kopuklu et al. (2019) aside from keeping temporal resolu-tion
throughout the network. Then we construct them to two-streamcounterparts
following the design philosophy of SlowFast networks.Table 2 makes a
comprehensive comparison on Kinetics-400, whereour SlowFast networks
equipped with the CMDA outperform the recentlightweight SOTA approaches.
Compared with baseline networks, ourmodels have slightly more
parameters and FLOPs mainly for the reasonthat we maintain the temporal
dimension along with the network untilFig. 5. Comparison of the pipeline
of the original SlowFast networks (the first row)
⊕ denotes concatenate. We can easily substitute with various basic stages from any ligof GhostNet and so on.
videos
under 27 classes. And the training set contains 118,562 videos,while
the validation set contains 14,787 videos. Due to the lack
ofground-truth labels, we do not use the test set following the
mainstreamapproaches. Note that there are only 5 of 27 classes (i.e.
DrummingFingers, Thumb Up, Thumb Down, Stop Sign, Shaking Hand) that
canbe described as static gestures and recognized from a single frame.
Allother categories require distinguishing between fine-grained motion
de-tails (such as Rolling Hand Backward, Rolling Hand Forward,
SwipingDown, Swiping Up and so on), which means that the
20BN-Jester-v1dataset is suitable to inspect the ability of the networks
in capturingmotion patterns (Materzynska et al., 2019; Kopuklu et al.,
2019).
4.2. Training details
All our models are trained from
scratch without using any pre-training. To validate our proposed fusion
module CMDA, we followthe recipe in SlowFast to pre-process, train and
test. For the Kinetics-400 dataset, we use the spatial size of 224 ∗
224, which is randomlycropped from a scaled video whose shorter side is
randomly sampledin [256, 320] pixels. For the temporal domain, we
randomly sample aclip (of ∗ ∗ frames) from the full-length video, and
the input tothe Slow and Fast pathways are respectively and ∗
frames. Aprediction.
5r efficient SlowFast networks (the second row)
equipped with CMDA fusion module.ght networks such as the inverted
residual module of MobileNetV2, the ghost module
Table 1Comparisons
with SOTA of SlowFast on Kinetics-400 validation set. Our
dual-attentionmodification equipped with the CMDA outperforms the
counterpart SlowFast networkswith a higher accuracy, and uses less
parameters and calculations due to thewell-designed processes.Model
#Params #FLOPs Top-1 Acc Top-5 Acc
SlowFast 34.57 M 50.31 G 77.0 92.6Ours 33.94 M 49.64 G 77.4 93.0
4.3. Fusion module results
As
described in Fig. 5, we only substitute the origin unidirectionalfusion
module with our proposed bidirectional CMDA module. Table 1compares
these two models in several dimensions.Our proposed CMDA fusion module
achieves better performancein accuracy. Notably, our results are
achieved at much lower FLOPsand fewer parameters than the original
SlowFast networks, which ismainly due to our efficient design in CMDA.
Even fusing informa-tion bidirectionally rather than unidirectionally,
our method selectssalient information by spatial–temporal and
channel-wise attention,the last fully connected layer.
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
ed twoxamplns can
csoeeaftoproo
sion module consistently improves the performance of the SlowFast
We
also conduct experiments on the 20BN-Jester-v1 dataset andompare
runtime performance with the baseline methods. Most of
thetate-of-the-art methods are heavy backbone architectures and
employptical flow to achieve higher accuracy, which is
computationallyxpensive to run in real-time. In contrast, as illustrated
in Table 3, ourfficient models enjoy a much higher FPS while maintain
competitiveccuracy. Considering the action body sitting in the middle of
therames, which is much easy to focus on and recognition, the role of
at-ention mechanism is limited and not as significant as the
improvementn Kinetics-400. Note that the MobileNetV2 has the lowest
runtimeerformance compared with ShuffleNet and ShuffleNetV2. The
maineason for this observation is that the CUDNN library is well
optimizedn standard convolution, while the MobileNetV2 uses a large
number
networks while maintaining the low computation cost.
CRediT authorship contribution statement
Dafeng
Wei: Conceptualization, Methodology, Writing – originaldraft. Ye Tian:
Visualization, Investigation. Liqing Wei: Software, Datacuration. Hong
Zhong: Supervision, Formal analysis. Siqian Chen: Re-sources. Shiliang
Pu: Project administration. Hongtao Lu: Supervision,Writing – review
& editing.
Declaration of competing interest
The authors
declare that they have no known competing finan-cial interests or
personal relationships that could have appeared toFig. 6. Activation
maps for our proposed SlowFastMobileNetV2 backbone. We visualizprecisely
on fine spatial information, such as the fingers in the 3rd frame of
the first eshow the first 8 frames of the Fast pathway due to space
constraints. More explanatio
Table 2Comparison with the
state-of-the-art light-weight action recognition methods onKinetics-400.
Our approaches achieve consistent and remarkable improvements basedon
all kinds of classical efficient networks. The number of floating point
operations(FLOPs) is calculated for 16 frames with spatial resolution of
112*112. Detailed analysiscan be found in Section 4.4.Modela Ours
#Params #FLOPs Top-1 Acc Top-5 Acc
3D ShuffleNetV2 0.25X 0.62 M 42 M
24.12 49.42SlowFastShuffleNetV2 0.25X ✓ 0.71 M 66 M 28.79 54.773D
ShuffleNetV2 1.0X 1.71 M 119 M 38.54 65.00SlowFastShuffleNetV2 1.0X ✓
1.95 M 234 M 47.26 73.133D ShuffleNetV2 2.0X 6.23 M 360 M 48.00
73.96SlowFastShuffleNetV2 2.0X ✓ 7.06 M 802 M 54.22 78.02
3D
ShuffleNet 2.0X G3 4.37 M 386 M 51.06 76.40SlowFastShuffleNet 2.0x G3 ✓
6.41 M 710 M 53.84 77.383D ShuffleNet 2.0X G1 4.19 M 406 M 50.19
74.44SlowFastShuffleNet 2.0X G1 ✓ 5.70 M 530 M 54.99 78.45
SlowFastGhostNet 1.0X ✓ 4.65 M 36 M 46.03 72.66
3D MoibleNetV2 1.0X 2.87 M 445 M 38.54 66.05SlowFastMoibleNetV2 1.0X ✓ 10.78 M 1621 M 48.12 73.68
aSince
the baseline network (Kopuklu et al., 2019) only provides the results
on theKinetics-600 dataset, we train and report results on the
Kinetics-400 dataset followingthe official code and recipe.
Table
3Accuracy and speed comparisons on 20BN-Jester-v1 dataset. The frames
per second(FPS) considers the forward process with spatial resolution of
112*112, and is measuredon a single NVIDIA Tesla P40 GPU.Model Ours
Top-1 Acc FPS
3D ShuffleNet 2.0X G3 93.54 139SlowFastShuffleNet 2.0x G3 ✓ 93.72 204
3D ShuffleNetV2 2.0X 93.71 123SlowFastShuffleNetV2 2.0x ✓ 94.17 196
3D
MobileNetV2 1.0x 94.59 126SlowFastMoibleNetV2 1.0x ✓ 94.62 115f
deep-wise convolution (Kopuklu et al., 2019; Xie et al., 2017).
6different
examples from the 20BN-Jester-v1 dataset. The Slow pathway focuses
moree, and the Fast pathway is also endowed with the ability of spatial
attention. We onlybe found in Section 4.5.
4.5. Visualization results
To
better understand the behavior of our proposed cross-modalitydual
attention models, we visualize the activation maps based on theGrad-CAM
method in Fig. 6. The Slow pathway is designed to focus onfine-grained
spatial information and the Fast pathway on long-rangetemporal motion
information.As expected, in the Slow pathway of the first example, our
networkperforms very well on not only the hand-wise but also on
fingers-wise.And with the help of motion information fused from the Fast
pathway,the slow part of the second example knows to focus on the
curved arcarea that the fingers swiped across.From the visualization, we
can also discover that the Fast pathwayhas the ability to pay attention
to the core and contextual views,which is transferred from the Slow
pathway and essential for videoclassification.These visual results
verify that our proposed CMDA indeed can fusespatial and temporal
information from the Slow and Fast pathways andbenefit each other to
further improve the performance of video actionclassification.
5. Conclusion
We
have proposed a new fusion strategy equipped with both spatial–temporal
attention and channel-wise attention mechanisms to explicitlyexchange
information between the Slow and the Fast pathways of theSlowFast
networks. Following the concept of the SlowFast networks,we developed
several efficient two-stream action recognition modelsbased on
well-designed GhostNet, ShuffleNet, ShuffleNetV2 and Mo-bileNetV2. The
salience maps plotted by the Grad-CAM verify thatthe proposed fusion
modality successfully empowers both pathwaysthe attention ability.
Experiments demonstrate that our proposed fu-influence the work reported
in this paper.
D. Wei, Y. Tian, L. Wei et al. Computer Vision and Image Understanding 222 (2022) 103484
Acknowledgments
This
paper is supported by NSFC (No. 62176155, 62066002), Shang-hai
Municipal Science and Technology Major Project, China, undergrant no.
2021SHZDZX0102.
References
Carreira, J., Noland, E.,
Banki-Horvath, A., Hillier, C., Zisserman, A., 2018. A shortnote about
kinetics-600. arXiv preprint arXiv:1808.01340.Carreira, J., Zisserman,
A., 2017. Quo vadis, action recognition? a new model and thekinetics
dataset. In: Proceedings of the IEEE Conference on Computer Vision
andPattern Recognition. pp. 6299–6308.Christoph, R., Pinz, F.A., 2016.
Spatiotemporal residual networks for video actionrecognition. Adv.
Neural Inf. Process. Syst. 3468–3476.Feichtenhofer, C., 2020. X3D:
Expanding architectures for efficient video recognition.In: Proceedings
of the IEEE/CVF Conference on Computer Vision and PatternRecognition.
CVPR.Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast
networks for videorecognition. In: Proceedings of the IEEE International
Conference on ComputerVision. pp. 6202–6211.Feichtenhofer, C., Pinz,
A., Zisserman, A., 2016. Convolutional two-stream networkfusion for
video action recognition. In: Proceedings of the IEEE Conference
onComputer Vision and Pattern Recognition. pp. 1933–1941.Girdhar, R.,
Carreira, J., Doersch, C., Zisserman, A., 2019. Video action
transformernetwork. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR2019, Long Beach, CA, USA, June 16-20, 2019. Computer
Vision Foundation /IEEE, pp. 244–253.Han, K., Wang, Y., Tian, Q., Guo,
J., Xu, C., Xu, C., 2020. GhostNet: More featuresfrom cheap operations.
In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern
Recognition. pp. 1580–1589.He, K., Zhang, X., Ren, S., Sun, J., 2016.
Deep residual learning for image recognition.In: IEEE Conference on
Computer Vision and Pattern Recognition.Hochreiter, S., Schmidhuber, J.,
1997. Long short-term memory. Neural Comput. 9 (8),1735–1780.Howard,
A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T.,
An-dreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional
neural networksfor mobile vision applications. arXiv preprint
arXiv:1704.04861.Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation
networks. In: Proceedings of theIEEE Conference on Computer Vision and
Pattern Recognition. pp. 7132–7141.Hubel, D.H., Wiesel, T.N., 1965.
Receptive fields and functional architecture intwo nonstriate visual
areas (18 and 19) of the cat. J. Neurophysiol. 28 (2),229–289.Hussein,
N., Gavves, E., Smeulders, A.W.M., 2019. Timeception for complex
actionrecognition. In: IEEE Conference on Computer Vision and Pattern
Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer
Vision Foundation/ IEEE, pp. 254–263.Iandola, F.N., Han, S., Moskewicz,
M.W., Ashraf, K., Dally, W.J., Keutzer, K., 2016.SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB
modelsize. arXiv preprint arXiv:1602.07360.Kay, W., Carreira, J.,
Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S.,Viola, F.,
Green, T., Back, T., Natsev, P., et al., 2017. The kinetics human
actionvideo dataset. arXiv preprint arXiv:1705.06950.Khan, M.A., Javed,
K., Khan, S.A., Saba, T., Habib, U., Khan, J.A., Abbasi, A.A.,2020.
Human action recognition using fusion of multiview and deep features:
anapplication to video surveillance. Multimedia Tools Appl.
1–27.7Kopuklu, O., Kose, N., Gunduz, A., Rigoll, G., 2019. Resource
efficient 3D convolutionalneural networks. In: Proceedings of the IEEE
International Conference on ComputerVision Workshops.Köpüklü, O., Wei,
X., Rigoll, G., 2019. You only watch once: A unified CNN architecturefor
real-time spatiotemporal action localization. arXiv preprint
arXiv:1911.06644.Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.,
2016. Online human action de-tection using joint
classification-regression recurrent neural networks. In:
EuropeanConference on Computer Vision.Lin, J., Gan, C., Han, S., 2019.
Tsm: Temporal shift module for efficient videounderstanding. In:
Proceedings of the IEEE International Conference on ComputerVision. pp.
7083–7093.Ma, N., Zhang, X., Zheng, H.-T., Sun, J., 2018. Shufflenet v2:
Practical guidelines forefficient cnn architecture design. In:
Proceedings of the European Conference onComputer Vision. ECCV, pp.
116–131.Materzynska, J., Berger, G., Bax, I., Memisevic, R., 2019. The
jester dataset: A large-scale video dataset of human gestures. In: 2019
IEEE/CVF International Conferenceon Computer Vision Workshop. ICCVW, pp.
2874–2882.Ng, Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O.,
Monga, R., Toderici, G.,2015. Beyond short snippets: Deep networks for
video classification.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A.,
Chen, L.-C., 2018. Mobilenetv2:Inverted residuals and linear
bottlenecks. In: Proceedings of the IEEE Conferenceon Computer Vision
and Pattern Recognition. pp. 4510–4520.Simonyan, K., Zisserman, A.,
2014. Two-stream convolutional networks for actionrecognition in videos.
In: Advances in Neural Information Processing Systems.
pp.568–576.Soomro, K., Zamir, A.R., Shah, M., 2012. UCF101: A dataset of
101 human actionsclasses from videos in the wild. arXiv preprint
arXiv:1212.0402.Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.,
2018. Optical flow guided feature:A fast and robust motion
representation for video action recognition. In: 2018 IEEEConference on
Computer Vision and Pattern Recognition, CVPR 2018, Salt LakeCity, UT,
USA, June 18-22, 2018. IEEE Computer Society, pp. 1390–1399.Tran, D.,
Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer
lookat spatiotemporal convolutions for action recognition. In:
Proceedings of the IEEEConference on Computer Vision and Pattern
Recognition. pp. 6450–6459.Vaswani, A., Shazeer, N., Parmar, N.,
Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,Polosukhin, I., 2017.
Attention is all you need. In: Advances in Neural InformationProcessing
Systems. pp. 5998–6008.Wang, X., Girshick, R., Gupta, A., He, K., 2018.
Non-local neural networks. In:Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.pp. 7794–7803.Wang, Q., Wu, B.,
Zhu, P., Li, P., Zuo, W., Hu, Q., 2020. ECA-net: Efficient
channelattention for deep convolutional neural networks. In: Proceedings
of the IEEE/CVFConference on Computer Vision and Pattern Recognition.
pp. 11534–11542.Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang,
X., Van Gool, L., 2016.Temporal segment networks: Towards good practices
for deep action recognition.In: European Conference on Computer Vision.
Springer, pp. 20–36.Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He,
K., 2017. Aggregated residual transfor-mations for deep neural networks.
In: 2017 IEEE Conference on Computer Visionand Pattern Recognition,
CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEEComputer Society,
pp. 5987–5995.Zhang, X., Zhou, X., Lin, M., Sun, J., 2018. Shufflenet:
An extremely efficientconvolutional neural network for mobile devices.
In: Proceedings of the IEEEConference on Computer Vision and Pattern
Recognition. pp. 6848–6856.Zhou, Y., Sun, X., Luo, C., Zha, Z.-J., Zeng,
W., 2020. Spatiotemporal fusion in 3D CNNs:A probabilistic view. In:
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern
Recognition. pp. 9829–9838.