GTM: Gray Temporal Model for Video Recognition

Yanping Zhang, Yongxin Yu Corresponding authors.

Abstract

Data input modality plays an important role in video action recognition. Normally, there are three types of input: RGB, flow stream and compressed data. In this paper, we proposed a new input modality: gray stream. Specifically, taken the stacked consecutive 3 gray images as input, which is the same size of RGB, can not only skip the conversion process from video decoding data to RGB, but also improve the spatio-temporal modeling ability at zero computation and zero parameters. Meanwhile, we proposed a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC) which captures the temporal relationship at channel-feature level within a controllable computation budget(by parameters $G$ & $R$ ). Finally, we confirm its effectiveness and efficiency on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB-51 and UCF-101, and achieve impressive results.

1 Introduction

In the real world, huge amounts of video data are generated every minute. As of May 2019, more than 500 hours of video were uploaded to YouTube every minute (Hale 2019). Advances in edge computing and next generation communication technology made it possible to analyze these videos in a real time manner. So video-based task is getting more focus and becoming more important.

For video action recognition, deep learning (Krizhevsky, Sutskever, and Hinton 2012) has become the standard and we have witnessed great advancements. Most of them use three types of input: RGB, optical flow and compressed data. Karpathy et al. (Karpathy et al. 2014) proposed to use a single 2D CNN model on each RGB frame independently and explored several fusing method to learn spatio-temporal features. Simonyan et al. (Simonyan and Zisserman 2014a) first proposed the two-stream networks, which included a RGB input and an optical flow (Brox et al. 2004) input respectively. Wu et al. (Wu et al. 2018) proposed to directly apply deep learning method in the compressed domain for action recognition. We have a question: Is there another input modality for action recognition?

Refer to caption — Figure 1: Example of 3 consecutive RGB images vs. gray images. First row: RGB. Second row: gray.

The RGB format is widely used in image based deep learning methods. It is straightforward and has a large number of ready-made models, such as VGG (Simonyan and Zisserman 2014b), Inception (Szegedy et al. 2015) and ResNet (He et al. 2016). However, RGB format may not be entirely suitable for video tasks. Restricted by storage and bandwidth, video files and streams are stored or transmitted in compressed format, such as MPEG-4, H.264 (Wiegand et al. 2003). After decompression, we will get YUV data directly. Y means luminance component and UV for two chrominance components. The YUV420 is the most widely used format which contains a subsampling process. So the data distribution of three signals(one Luma and two Chroma) is not equal. In Figure 2, we showed the simple process of decoding a video, and then convert to RGB format. During the conversion from YUV420 to RGB, we observed that the data size is doubled, which requires extra computation and more storage.

The flow stream (Farnebäck 2003; Zach, Pock, and Bischof 2007) has proven to be a good representation of the short-term motion between adjacent frames. Zhao et al. (Zhao and Snoek 2019) proves that the more accurate the optical flow, the more the model improves. However, the computation of optical flow is time-consuming and storage demanding, thus making it impractical for real-world deployment.

For action recognition in compressed domain, it has a long tradition (Tom and Babu 2013; Ma and Song 2019). The compressed data itself contains a significant number of useful clues that can be used to help classify, including Motion Vector, Residual, Quantization Parameter, Macro Block Size, MB in bits, QP Gradient. For deep learning methods, most of them (Wang, Lu, and Deng 2019; Battash et al. 2020) only use Motion Vector and Residual so far. And many of them are not trained and evaluated on general large-scale datasets such as kinetics (Kay et al. 2017). So the deep learning approaches in compressed domain are promising but far from being explored.

To address this issue, we investigated several video-based inputs and found that taken the stacked consecutive 3 gray images as input, which is called gray stream, can improve the modeling ability at zero computation and zero parameters. In Figure 1, we visualize the 3 consecutive RGB vs. gray images. The gray stream contains not only local spatial appearance information represented by individual frame but also local temporal dependency among these successive frames.

Given the new input modality, we think more about current models. 3D based CNN models involve a huge amount of computation, while 2D models lack of temporal modeling capabilities. Inspired by this observation, we propose a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC), which can be easily inserted into 2D CNNs with a plug-and-play manner to improve temporal modeling abilities. The 1D-ICSC learns to capture different temporal relationship for different channels at a controllable computation budget. To summarize, the main contributions of our method are three-fold:

$\bullet$

We propose a new input modality (gray stream) for video action recognition and demonstrate its efficiency.
$\bullet$

A simple yet effective 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC) is proposed, which can greatly improve temporal modeling ability for 2D based CNN.
$\bullet$

We evaluated the proposed method on several public benchmark datasets, including Something-Something (Goyal et al. 2017), Kinetics-400 (Kay et al. 2017), UCF-101 (Soomro, Zamir, and Shah 2012) and HMDB-51 (Kuehne et al. 2011), and achieved impressive result.

2 Related Works

There are several trends in video action recognition. The first one is about the network evolvement, from 2D CNNs, LSTM to 3D CNNs. Karpathy et al. (Karpathy et al. 2014) proposed to apply a 2D CNN model on large-scale Sports-1M dataset and setup the beginning of deep learning methods. Ng et al. (Yue-Hei Ng et al. 2015) take the feature maps from CNNs then send to the LSTM network, and aggregate frame-level CNN features to model the temporal relation. In these approaches, the feature extraction of each frame is isolated and only late fusion of high-level features is performed, thus get no clear improvement. Tran et al. (Tran et al. 2015) first proposed a deep 3D network, termed C3D which performed 3D convolutions on adjacent frames to jointly model the spatial and temporal features. However, with tremendous parameters to be optimized and lack of high-quality large-scale datasets, the performance of C3D remains unsatisfactory. The situation changed when Carreira et al. (Carreira and Zisserman 2017) proposed I3D, which achieved very competitive performance with the help of high-quality large-scale Kinetics (Kay et al. 2017) dataset and push video action recognition to the next level. In the latest work Feichtenhofer introduced X3D (Feichtenhofer 2020), which progressively expand a tiny 2D image classification architecture along multiple network axes, such as temporal duration, spatial resolution, width, etc. X3D learned from the history of image classification models, and pushed 3D model to an extreme.

The second line mainly focuses the improvement of feature expression. Simonyan et al. (Simonyan and Zisserman 2014a) proposed the two-stream approach and setup a trend. Following this trend, many excellent works (Feichtenhofer, Pinz, and Zisserman 2016; Wang et al. 2016) emerged and dominated the video recognition domain from year 2014 to 2017. Because pre-computing optical flow is computationally expensive and storage demanding, many works seeks for other substitutes. Kantorov et al. (Kantorov and Laptev 2014) proposed the use of sparse MPEG flow instead of the dense optical flow, which improved the speed of feature extraction by two orders of magnitude with minor reduction in accuracy.

The last one focused on computational efficiency and real world deployment. ECO (Zolfaghari, Singh, and Brox 2018), TSM (Lin, Gan, and Han 2019), STM (Jiang et al. 2019) and TEA (Li et al. 2020) are the excellent ones. Lin et al. proposed a new method, termed temporal shift module(TSM). It shifts part of the channels along the temporal dimension and thus facilitate information exchange among neighboring frames. It built temporal modeling inside 2D CNNs at zero computation and zero parameters.

Among these approaches, SlowFast (Feichtenhofer et al. 2019) made attempt to replace the RGB input with gray-scale input in their fast pathway. They found that the gray-scale version is nearly as good as the RGB variant, meanwhile reduces FLOPs by %5. StNet (He et al. 2019) sampled $T$ temporal segments, each of which consists of $N(N=5)$ consecutive RGB frames. These $N$ frames are stacked in the channel dimension. So the network input is a tensor of size $T\times 3N\times H\times W$ and is called super-image. Super-Image contains both spatial information and local temporal dependency. TDN (Wang et al. 2021) generalize the idea of RGB difference to devise an efficient temporal difference module for motion modeling. These works made remarkable research in both input modalities and network architectures.

Different from previous works, our proposed approaches focus on video-based modalities and efficient 1D spatio-temporal modeling, make it more suitable for video-based tasks and more practical for real world deployment.

3 Approach

In this section, we will introduce the technical details of our approach. First, we will discuss several video-based modalities, such as YCbCr. Afterward, we will present 1D-ICSC which can be embedded 2D CNN in a plug-and-play manner.

3.1 Gray Stream

In H.264/AVC (Wiegand et al. 2003) as well as the previous standards(MPEG-1 (for Standardization/International Electrotechnical Commission et al. 1993), MPEG-2 (Union-Telecommun 1994)), they use video color space: YCbCr¹¹1In this paper we use the terms YCbCr and YUV interchangeably, although they are not exactly the same in a strict manner.. It separates a color representation into three components called Y, Cb, and Cr. Component Y is called luma, and represents brightness. The two chroma components Cb and Cr represent the extent to which the color deviates from gray toward blue and red, respectively.

Because the human visual system is more sensitive to luma than chroma, subsampling is performed in which all the luma(Y) information is preserved and chroma information(CbCr) is reduced by a factor 2 in both horizontal and vertical directions. This is called 4:2:0 sampling with 8 bits of precision per sample. The whole subsampling process is lossy but does not affect the perceived quality. In Figure 4 we visualize the three components of YCbCr. It can be seen that a single Y component is enough for human to recognize what is going on. According to standard ITU-R Recommendation BT.601 (BT et al. 2011), the conversion from YCbCr to RGB is as:

	$\displaystyle R=[Y+1.402\times(C_{r}-128)]_{0}^{255}$
	$\displaystyle G=[Y-0.344\times(C_{b}-128)-0.714\times(C_{r}-128)]_{0}^{255}$
	$\displaystyle B=[Y+1.772\times(C_{b}-128)]_{0}^{255}$		(1)

$[\quad]_{0}^{255}$ denotes clamping a value to the 8-bit range of 0 to 255. We can see that RGB frame is a transformation from YCbCr. This requires extra computation. In image domain, another widely used technology is to convert RGB image to gray-scale. According to (BT et al. 2011), the conversion from RGB to gray is computed as:

\displaystyle L=0.299\times R+0.587\times G+0.114\times B

(2)

We visualize the gray-scale image in Figure 4(b). We can see that the Y component image (Luma) and gray-scale image have visual similarity. An intuitive idea is to replace the network input with gray-scale image. So we choose gray-scale as another candidate. Until now we totally get 4 modalities: Y, U, V and Gray. All of them only have one channel, while RGB has three channels(red, green, blue). So for each modality, there are two ways to construct data. First is to use only one frame. This requires modifying the network input. Another is stacking consecutive 3 frame to form 3 channels. It is the same size of RGB and does not require any modification for network. For simplicity, we call this gray stream(for all 4 modalities). In the first ablation study, we will show its superiority over RGB and Flow.

3.2 Spatio-Temporal Block

In order to keep the framework effective yet lightweight, we choose the TSN (Wang et al. 2016) with ResNet-50 (He et al. 2016) backbone. Since a raw 2D network can not effectively capture temporal dynamics which has been evidenced by previous works (Zhou et al. 2018; Lin, Gan, and Han 2019), we designed a spatio-temporal module to tackle this problem. Figure 5 (b) shows our spatio-temporal block embedded with 1D-ICSC.

1D-ICSC.

Channel-wise temporal modeling has been explored by (Lin, Gan, and Han 2019; Jiang et al. 2019; Li et al. 2020) previously, which is designed to model motion information based on the channel level instead of raw pixel-level. Different from previous works, we proposed 1D-ICSC to capture the temporal relationship, and introduce two factors( $G$ & $R$ ) to control the computation budget.

As illustrated in Figure 5 (b), the shape of input spatiotemporal feature is $\mathbf{F}\in\ \mathbb{R}^{N\times T\times C\times H\times W}$ , where N is the batch size. T and C denote temporal dimension and feature channels, respectively. H and W correspond to spatial shape. We first reshape $\mathbf{F}\rightarrow\mathbf{F^{*}}\in\ \mathbb{R}^{NHW\times C\times T}$ and then apply the channel-wise 1D convolution as equation (3).

\displaystyle\mathbf{F}^{*}_{o}={}\mathbf{K}*\mathbf{F}^{*}

(3)

$\mathbf{K}$ is a 1D convolutional layer with kernel size 3 and $\mathbf{F^{*}_{o}}\in\ \mathbb{R}^{NHW\times C\times T}$ . Next we reshape $\mathbf{F^{*}_{o}}$ to the original input shape (i.e.[ $N,T,C,H,W$ ]) and model local-spatial information via original ResNet block. Different from previous works, we specially initialized the parameters of $\mathbf{K}$ to make $\mathbf{F}^{*}_{o}=\mathbf{F}^{*}$ at initial stage. So it is called 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC). We don’t make assumptions about how the channel moves and interacts, but instead relax the kernel weights to learn it during training procedure. Experiments show that identity parameter initialization strategy brings performance improvement.

Further, we introduce two factors $G$ and $R$ to control the computation cost. $G$ is the groups number of $\mathbf{K}$ . So the weight shape of $\mathbf{K}$ is $C\times\frac{C}{G}\times 3$ . $R$ is used to control how many spatio-temporal blocks are added. Theoretically, more spatio-temporal blocks will bring higher performance, but will increase parameters and FLOPs. Without loss of generality, when $\mathbf{BlockNumber~{}\%~{}R=0}$ in each layer, we add a spatio-temporal block. Table 1 shows the GFLOPs when $G$ and $R$ varies. We can see that when $G$ and $R$ increases, the FLOPs decreases.

	1	2	4	8
1	107.26	70.11	51.54	42.25
2	72.73	52.85	42.90	37.93
3	57.93	45.45	39.20	36.08
4	53.00	42.98	37.97	35.47

Table 1: The corresponding GFLOPs between

G

R

. The bigger

G

and

R

, the smaller FLOPs. The network input is 8×3×224×224.

3.3 GTM Network

After introducing the gray stream and 1D-ICSC, we are ready to describe how to integrate them into the existing network architecture and build the gray temporal model (GTM) network. As shown in Figure 3, the 2D ResNet-50 is utilized as the backbone. First we divide one video into T segments. For each segment, we random sample 3 consecutive gray images. So the input of the network is ${N\times T\times 3\times 224\times 224}$ , which is the same size of RGB input. From layer2 to layer5, 1D-ICSC is inserted at the beginning of each ResNet block to build Spatio-Temporal block, which is controlled by parameter $G$ and $R$ . The simple temporal pooling is applied to average action predictions for the entire video. Note that the whole framework is simple and straight forward, and does not require any modification for original blocks.

4 Experiments

4.1 Dataset & Implementations

Datasets.

We evaluate our approach on two large-scale action recognition datasets, Something-Something (Goyal et al. 2017), Kinetic-400 (Kay et al. 2017), and other two small-scale datasets, HMDB-51 (Kuehne et al. 2011) and UCF-101 (Soomro, Zamir, and Shah 2012). The Something-Something V2 dataset is a large collection of humans performing actions with everyday objects. Kinetics-400 is a large-scale YouTube video dataset and we download it from CVDF (CVDF 2021), including 238,796 training videos , 19,877 validation videos and 38,671 test videos. We use Ffmpeg (ffmpeg 2021) to extract the YUV data and save it in HDF5 (Folk et al. 2011). For Kinetics, we resize the video height to 240 without changing its aspect ratio to speed up training.

Models.

We choose TSM (Lin, Gan, and Han 2019) as our baseline. To have an apple-to-apple comparison with TSM, we used the same backbone (ResNet-50) and the models are pre-trainded on ImageNet (Russakovsky et al. 2015) unless stated otherwise.

Training.

Most of the experimental settings are the same as TSM (Lin, Gan, and Han 2019) and STM (Jiang et al. 2019). Given an input video, we first divide it into T segments, then we randomly sample one or 3 consecutive frames from each segment. During training, random scaling and corner cropping are utilized for data augmentation, and the cropped region is resized to 224 × 224 for each frame. Therefore, the input size of the network is N × T × C × 224 × 224, where N is the batch size, T is the number of segments, and C is the input channel number. Horizontal flipping is applied except for Something-Something dataset.

We train our model with 2 Tesla V100(16G) GPUs. Limited by GPU memory, we set T = 8 and use a relatively small batch size 32. For Kinetics, Something-Something v1 & v2 datasets, the initial learning rate is 0.005. It is reduced by a factor of 10 at 30,40,45 epochs and stop at 50 epochs. The dropout rate is 0.5. Stochastic gradient descent (SGD) is utilized as an optimizer. Momentum and weight decay value is set to 0.9 and 1e-4. All the batch normalization layers (Ioffe and Szegedy 2015) are enabled during training.

Inference.

Two evaluation protocols are considered to trade-off accuracy and speed:1) Efficient Protocol: 1-clip and center-crop where only a center crop of 224 × 224 from a single clip is used. 2) Accuracy Protocol: 10-clip and 3-crop where three crops(left, middle, right) of 224 × 224 and 10 clips are used for testing. The final prediction was the averaged score for all clips. By default we use Efficient Protocol for all tests. We only employ Accuracy Protocol for Kinetics.

4.2 Ablation Study

Modality	Input	UCF-101	HMDB-51	STH-V1
RGB	83H*W	83.1	49.9	18.2
Flow	810H*W	86.6	57.3	36.9
1-Y	81H*W	83.2	47.8	-
3-Y	83H*W	87.7	55.9	-
1-U	81H/2*W/2	54.9	26.7	-
3-U	83H/2*W/2	68.5	37.3	-
1-V	81H/2*W/2	58.2	30.7	-
3-V	83H/2*W/2	70.1	38.9	-
1-Gray	81H*W	83.0	48.0	17.3
3-Gray	83H*W	87.8	55.5	38.9

Table 2: Comparison of modalities. Input(segments * channel * height * width) denotes the input shape of one video. 1-x means one single image. 3-x means consecutive 3 images.

In this section, we first conduct several ablation experiments to testify the effectiveness of different components in our proposed methods. The ablation experiments are performed on Something-Something v1, UCF-101 split 1 and HMDB-51 split 1. Top-1 accuracy is reported.

Modalities.

First we compare several modalities, including RGB, Grayscale, YUV and optical flow. We use denseflow (Wang et al. 2020) to extract optical flow with Farnebäck algorithm (Farnebäck 2003) because of its efficiency. Here we use TSN with ResNet-50 backbone. Videos are divided into 8 segments. The results are shown in Table 2.

First for STH-V1, 3-Gray gains 20% compared with RGB (38.9% vs. 18.2%) and it also increased by 2% over Flow. In UCF-101 dataset, 3-Gray modality achieved the best top-1 (87.8%) and 3-Y also got comparable result (87.7%). Both of them surpass Flow (86.6%) and RGB (83.1%). Second, for all three datasets, 1-Gray or 1-Y achieved similar performance compared with RGB. This is consist with SlowFast (Feichtenhofer et al. 2019). It indicates for action recognition, a single Y or Gray channel contains equivalent information of RGB. Third, U and V modalities get inferior results, we argue that UV contains less information and has lower resolution(1/2 of Y).

In summary, compared with RGB, our gray stream can improve accuracy by a large margin without any extra parameters and FLOPs, or any optical flow pre-calculation. This indicates that video tasks are not exactly the same as image tasks.

Modality	Interval	UCF-101	STH-V1
3-Gray	1	87.8	38.9
	2	87.1	38.4
	3	87.3	38.3
	4	86.3	38.5
	5	86.6	38.0

Table 3: Comparison of different Sampling Intervals.

Sampling Intervals.

For our gray stream, it need 3 video frames to form one segment input. An intuitive question arises: do different sampling intervals affect the results? Here we compare different sampling intervals. The results are shown in Table 3. Intervals 1 to 3 achieved similar result. From interval 4, there was a performance reduction. We argue that for 2D backbone network, the spatial modeling plays an important role, and large sampling interval will hurt the spatial modeling ability. For simplicity, we use interval 1 as default(which means 3 consecutive frames).

Modalities	Backbone	UCF-101	HMDB-51
3-Y	Resnet-18	84.6	49.0
	Resnet-34	86.5	53.5
	Resnet-50	87.7	55.9
3-U	Resnet-18	64.9	31.6
	Resnet-34	67.7	36.1
	Resnet-50	68.5	37.3
3-V	Resnet-18	66.5	34.9
	Resnet-34	69.6	37.8
	Resnet-50	70.1	38.9
3-Gray	Resnet-18	84.1	50.7
	Resnet-34	86.3	53.7
	Resnet-50	87.8	55.5

Table 4: Comparison of different backbones for 4 modalities.

Backbone Choice.

Because U and V image have 1/2 resolution of Y images, it is not necessary to use a relative heavy backbone Resnet-50. Here we compared 3 backbones: Resnet-18, Resnet-34 and Resnet-50 for 4 modalities. The results are shown in Table 4. For all modalities, Resnet-50 consistently achieved best results. For 3-U and 3-V, Resnet-50 provides slight performance boost(around 1%) compared to Resnet-34. So for 3-Y and 3-Gray, we use Resnet-50 as default backbone. For 3-U and 3-V, we use Resnet-34 as default.

Modality	Temporal	UCF-101	STH-V1
RGB	None	83.1	18.2
	Fixed	83.2	45.6
	3D-Shift	83.8	45.8
	3D-Identity	84.6	45.9
	1D-Shift	85.0	45.9
	1D-ICSC	85.3	46.1
3-Gray	None	87.8	38.9
	Fixed	87.5	48.8
	3D-Shift	87.2	48.7
	3D-Identity	87.9	48.9
	1D-Shift	87.8	48.6
	1D-ICSC	88.0	49.3

Table 5: Comparison of different temporal modeling methods. In the second column, “None” means no temporal modeling is used.

1D-ICSC.

Here we compare different temporal modeling methods, including Fixed, 1D Convolution and 3D Convolution. Fixed means 1/8 channels forward shift and 1/8 backward shift, which is the same as TSM (Lin, Gan, and Han 2019). For 1D and 3D convolution, there are two parameters initialization strategies: Identity and Shift. Identity convolutions are initialized as equation (3) to make the input and output equals. Shift convolutions are initialized to perform like Fixed (1/8 channels forward and 1/8 backward). The kernel size of 3D convolution is $3\times 3\times 3$ .

The results are shown in Table 5. First we notice that Identity convolution achieved better results than Shift, both 1D and 3D convolution. Second, the 1D-ICSC achieved best result, even surpassed the 3D convolution. This indicates that proper temporal convolution is essential for temporal modeling, even though 3D convolution involves much more parameters. For STH-V1 RGB, 1D-ICSC significantly increase accuracy from 18.2% to 46.1%. While for 3-Gray, it also get an increase of 10%.

G & R

We test different $G$ & $R$ paramters in STH-V1 dataset. The results are shown in Table 6. Smaller $G$ & $R$ get better results. This is consistent with Table 1 as smaller $G$ & $R$ involve more parameters and FLOPs.

	1	2	4	8
2	50.0	49.6	49.5	49.2
4	49.3	49.1	49.1	49.0

Table 6: Top-1 accuracy under different

G

R

paramters on STH-V1 dataset. 3-Gray modality is used.

Method	Backbone	Pre-train	Frames	Param.	GFLOPs	Top-1	Top-5
I3D-RGB(Carreira et al. 2017)			64×N/A	12.7M	108 × N/A	71.1	89.3
I3D-Flow(Carreira et al. 2017)	3D Inception V1	ImageNet	64×N/A	12.7M	108 × N/A	63.4	84.9
2-Stream I3D(Carreira et al. 2017)			128×N/A	25M	216 × N/A	74.2	91.3
ECO-RGB_En(Zolfaghari et al. 2018)	BNIncep+3D Res18	Scratch	92	47.5M	267	70.0	89.4
NL I3D-RGB (Wang et al. 2018)	3D ResNet50	ImageNet	128	35.3M	282	67.3	-
NL I3D-RGB (Wang et al. 2018)	3D ResNet50	ImageNet	128×3×10	35.3M	282×30	76.5	92.6
SlowFast 8×8 (Feichtenhofer et al. 2019)	3D ResNet50	Scratch	(8+64)×3×10	-	65.7×30	77.0	92.6
TSN-RGB (Wang et al. 2016)	BN-Inception	ImageNet	25×10	10.7M	53×10	69.1	88.7
TSN-RGB (Wang et al. 2016)	ResNet-50	ImageNet	8	24.3M	33G	66.8	-
R(2+1)D-RGB (Tran et al. 2018)	ResNet-34	Scratch	32×10	63.8M	152×10	72.0	90.0
R(2+1)D-Flow (Tran et al. 2018)			32×10	63.8M	152×10	67.5	87.2
R(2+1)D 2-Stream (Tran et al. 2018)			64×10	127.6M	304×10	73.9	90.9
TSM (Lin, Gan, and Han 2019)	ResNet-50	ImageNet	8	24.3M	33	70.6	-
TSM (Lin, Gan, and Han 2019)	ResNet-50	ImageNet	8×3×10	24.3M	33×30	74.1	91.2
STM-RGB (Jiang et al. 2019)	ResNet-50	ImageNet	16×3×10	24M	67×30	73.7	91.6
St-Net (He et al. 2019)	ResNet-50	ImageNet	25	33M	189	69.9	-
TEA (Li et al. 2020)	Res2Net-50	ImageNet	8	24.5M	35×1	72.5	90.4
TEA (Li et al. 2020)	Res2Net-50	ImageNet	8×3×10	24.5M	35×30	75.0	91.8
TDN (Wang et al. 2021)	ResNet-50	ImageNet	8×3×10	-	36×30	76.6	92.8
GTM (RGB)	ResNet-50	ImageNet	8	28M	43	70.8	89.5
GTM (3-Y)			8	28M	43	70.4	89.5
GTM (3-Gray)			8	28M	43	70.4	89.6
GTM (RGB + 3-Gray)			16	49M	86	73.4	91.2
GTM (RGB + 3-Gray)			16×3×10	49M	86×30	75.2	92.1

Table 7: Comparison of our GTM network with other state-of-the-art methods on Kinetics-400 validation set.

4.3 Comparisons with the State-of-the-arts

In this section, we compare our proposed GTM network with the existing state-of-the-art action recognition methods. In these experiments, we set $G$ =2 and $R$ =4 to get a balance between accuracy and FLOPs unless stated otherwise.

Results on Kinetics-400.

We evaluate the GTM network against the recent state-of-the-art 2D/3D convolution-based solutions. The comprehensive statistics, including the classification results, inference protocols, parameters, and the corresponding GFLOPs, are shown in Table 7. The first compartment contains the methods based on 3D CNNs or a mixup of 2D and 3D CNNs. The second compartment contains methods based on 2D CNNs. For fair comparison, we mainly list the architecture with ResNet-50 backbone. We can see that under Efficient Protocol, our RGB method get 70.8%, which surpass the stNet (He et al. 2019) and ECO (Zolfaghari, Singh, and Brox 2018). It is worth noting that for Kinetics dataset, flow modality usually gets inferior result than RGB. But our 3-Y and 3-Gray still achieved 70.4%, which is comparable with RGB(70.8%). This shows the robustness of our proposed gray stream modality. And they both surpass the flow modality methods by a large margin, such as I3D-Flow(63.4%), R(2+1)D-Flow(67.5%). Further a simple average of RGB and 3-Gray can bring top-1 accuracy to 73.4%. This shows the advantage of our gray stream which is complementary to RGB.

Results on STH V2.

The Something-Something V2 dataset is more temporal-related than Kinetics. The comparison results are list in Table 8. Our 3-Y achieves 61.7% top-1 accuracy which outperforms TSM (Lin, Gan, and Han 2019) by 2.6%. And it also improves our RGB by 2.8%. This indicates that in temporal-related datasets, gray stream can bring more improvement than that in scene-related datasets. The average result of RGB+3Y can increase the top-1 accuracy to 63.6%. And when combined with 3-V, it further increases to 64.5%. The superior performance demonstrates the effectiveness of our proposed approaches.

Method	GFLOPs	Top-1	Top-5
TRN Multiscale_8f (Zhou et al. 2018)	33	48.8	77.6
TRN 2-Stream_8f (Zhou et al. 2018)	-	55.5	83.1
TSM_8f (Lin, Gan, and Han 2019)	33×6	59.1	85.6
STM_8f (Jiang et al. 2019)	33×30	62.3	88.8
Dynamic_16f (Wu et al. 2020)	48	58.2	85.2
TEINet_8f (Liu et al. 2020)	33	61.3	-
ACTION-Net_8f(Wang et al. 2021)	35	62.5	87.3
TDN_8f (Wang et al. 2021)	36	64.0	88.8
GTM (RGB)	43	58.9	85.0
GTM (3-Y)	43	61.7	87.4
GTM (3-V)	30	51.6	80.1
GTM (RGB + 3-Y)	86	63.6	88.6
GTM (RGB + 3-Y + 3-V)	116	64.5	89.2

Table 8: Comparison with the state-of-the-art methods on Something-Something V2 validation set.

Results on UCF-101 & HMDB-51.

UCF-101 and HMDB-51 are comparatively small-scale datasets with a long history, but they are still worth studying to trace the development of action recognition. We list some main results in Table 9. On HMDB-51 RGB, our method achieved 56.5% compared with I3D(49.8%). On UCF-101 RGB, our method achieved 87.4%. When we compared 3-Y with Flow On both datasets, it surpassed Two-Stream and 3D-Fused by a large margin(around 4%).

	UCF-101			HMDB-51
Architecture	RGB	Flow	R+F	RGB	Flow	R+F
LSTM*	81.0	-	-	36.0	-	-
3D-ConvNet*	51.6	-	-	24.3	-	-
Two-Stream*	83.6	85.6	91.2	43.2	56.3	58.3
3D-Fused*	83.2	85.8	89.3	49.2	55.5	56.8
I3D*	84.5	90.6	93.4	49.8	61.9	66.4
Architecture	RGB	3-Y	R+3Y	RGB	3-Y	R+3Y
GTM(Ours)	87.4	89.2	91.4	56.5	60.2	62.0

Table 9: Comparison in UCF-101 and HMDB-51(split 1 of both). All models are pre-trained on ImageNet except 3D-ConvNet. * denotes the results are cited from I3D (Carreira and Zisserman 2017). R+F means average result of RGB and Flow. R+3Y means average result of RGB and 3-Y. Here we set

G

=1 and

R

=2 for better accuracy.

5 Conclusion

In this paper, we proposed a new input modality gray stream for action recognition. It skips the conversion process from video decoding data to RGB, and improves the spatiotemporal modeling ability at zero computation and zero parameters. Experiments showed its superiority over RGB and Flow on various datasets, including Kinetics-400, Something-Something, UCF-101 and HMDB-51. Further we proposed a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC), which is simple yet effective to improve spatio-temporal modeling ability. The further work may be to integrate gray stream and RGB into a unified framework. We hope our analysis will provide insights about video-based approaches for action recognition.

References

Battash et al. (2020) Battash, B.; Barad, H.; Tang, H.; and Bleiweiss, A. 2020. Mimic the raw domain: Accelerating action recognition in the compressed domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 684–685.
Brox et al. (2004) Brox, T.; Bruhn, A.; Papenberg, N.; and Weickert, J. 2004. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision, 25–36. Springer.
BT et al. (2011) BT, R. I.-R.; et al. 2011. Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios.
Carreira and Zisserman (2017) Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
CVDF (2021) CVDF. 2021. CVDF. https://github.com/cvdfoundation/kinetics-dataset/.
Farnebäck (2003) Farnebäck, G. 2003. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, 363–370. Springer.
Feichtenhofer (2020) Feichtenhofer, C. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203–213.
Feichtenhofer et al. (2019) Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6202–6211.
Feichtenhofer, Pinz, and Zisserman (2016) Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1933–1941.
ffmpeg (2021) ffmpeg. 2021. ffmpeg. https://github.com/ffmpeg/.
Folk et al. (2011) Folk, M.; Heber, G.; Koziol, Q.; Pourmal, E.; and Robinson, D. 2011. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, 36–47.
for Standardization/International Electrotechnical Commission et al. (1993) for Standardization/International Electrotechnical Commission, I. O.; et al. 1993. Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s. ISO/IEC 11172.
Goyal et al. (2017) Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. 2017. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, 5842–5850.
Hale (2019) Hale, J. 2019. More than 500 hours of content are now being uploaded to youtube every minute. Santa Monica, CA: Tubefilter.
He et al. (2019) He, D.; Zhou, Z.; Gan, C.; Li, F.; Liu, X.; Li, Y.; Wang, L.; and Wen, S. 2019. Stnet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 8401–8408.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. PMLR.
Jiang et al. (2019) Jiang, B.; Wang, M.; Gan, W.; Wu, W.; and Yan, J. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2000–2009.
Kantorov and Laptev (2014) Kantorov, V.; and Laptev, I. 2014. Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2593–2600.
Karpathy et al. (2014) Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; and Fei-Fei, L. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732.
Kay et al. (2017) Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097–1105.
Kuehne et al. (2011) Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, 2556–2563. IEEE.
Li et al. (2020) Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; and Wang, L. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 909–918.
Lin, Gan, and Han (2019) Lin, J.; Gan, C.; and Han, S. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7083–7093.
Liu et al. (2020) Liu, Z.; Luo, D.; Wang, Y.; Wang, L.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; and Lu, T. 2020. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11669–11676.
Ma and Song (2019) Ma, M.; and Song, H. 2019. Effective moving object detection in H. 264/AVC compressed domain for video surveillance. Multimedia Tools and Applications, 78(24): 35195–35209.
Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211–252.
Simonyan and Zisserman (2014a) Simonyan, K.; and Zisserman, A. 2014a. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199.
Simonyan and Zisserman (2014b) Simonyan, K.; and Zisserman, A. 2014b. Very deep convolutional networks for large-scale image recognition. ICLR.
Soomro, Zamir, and Shah (2012) Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Szegedy et al. (2015) Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9.
Tom and Babu (2013) Tom, M.; and Babu, R. V. 2013. Fast moving-object detection in H. 264/AVC compressed domain for video surveillance. In 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 1–4. IEEE.
Tran et al. (2015) Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497.
Tran et al. (2018) Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Paluri, M. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450–6459.
Union-Telecommun (1994) Union-Telecommun, I. T. 1994. Generic coding of moving pictures and associated audio information-Part 2: Video. Int. Standards Org./Int. Electrotech. Comm.(ISO/IEC) JTC 1, Rec. H. 262 and ISO/IEC 13 818-2 (MPEG-2 Video).
Wang et al. (2021) Wang, L.; Tong, Z.; Ji, B.; and Wu, G. 2021. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1895–1904.
Wang et al. (2016) Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36. Springer.
Wang et al. (2020) Wang, S.; Li, Z.; Zhao, Y.; Xiong, Y.; Wang, L.; and Lin, D. 2020. denseflow. https://github.com/open-mmlab/denseflow.
Wang, Lu, and Deng (2019) Wang, S.; Lu, H.; and Deng, Z. 2019. Fast object detection in compressed video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7104–7113.
Wang et al. (2018) Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7794–7803.
Wiegand et al. (2003) Wiegand, T.; Sullivan, G. J.; Bjontegaard, G.; and Luthra, A. 2003. Overview of the H. 264/AVC video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7): 560–576.
Wu et al. (2018) Wu, C.-Y.; Zaheer, M.; Hu, H.; Manmatha, R.; Smola, A. J.; and Krähenbühl, P. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6026–6035.
Wu et al. (2020) Wu, W.; He, D.; Tan, X.; Chen, S.; Yang, Y.; and Wen, S. 2020. Dynamic inference: A new approach toward efficient video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 676–677.
Yue-Hei Ng et al. (2015) Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; and Toderici, G. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4694–4702.
Zach, Pock, and Bischof (2007) Zach, C.; Pock, T.; and Bischof, H. 2007. A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, 214–223. Springer.
Zhao and Snoek (2019) Zhao, J.; and Snoek, C. G. 2019. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9935–9944.
Zhou et al. (2018) Zhou, B.; Andonian, A.; Oliva, A.; and Torralba, A. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), 803–818.
Zolfaghari, Singh, and Brox (2018) Zolfaghari, M.; Singh, K.; and Brox, T. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV), 695–712.