1 Introduction
Collaborative intelligence (CI) [1] has emerged as a promising strategy to bring AI “to the edge.” In a typical CI system (Fig. 1), a deep neural network (DNN) is split into two parts: the edge submodel, deployed on the edge device near the sensor, and the cloud submodel deployed in the cloud. Intermediate features produced by the edge submodel are transferred from the edge device to the cloud. It has been shown that such a strategy may provide better energy efficiency [2, 3], lower latency [2, 3, 4], and lower bitrates over the communication channel [5, 6], compared to more traditional cloudbased analytics where the input signal is directly sent to the cloud. These potential benefits will find a number of applications in areas such as intelligent sensing [7] and video coding for machines [8, 9]. In particular, compression of intermediate features has become an important research problem, with a number of recent developments [10, 11, 12, 13, 14] for the case when the input to the edge submodel is a still image.
When the input to the edge submodel is video, its output is a sequence of feature tensors produced from successive frames in the input video. This sequence of feature tensors needs to be compressed prior to transmission and then decoded in the cloud for further processing. Since motion plays such an important role in video processing and compression, we are motivated to examine whether any similar relationship exists in the latent space among the feature tensors. Our theoretical and experimental results show that, indeed, motion from the input video is approximately preserved in the channels of the feature tensor. An illustration of this is presented in Fig. 2
, where the estimated inputspace motion field is shown on the left, and the estimated motion fields in several feature tensor channels are shown on the right. These findings suggest that methods for motion estimation, compensation, and analysis that have been developed for conventional video processing and compression may provide a solid starting point for equivalent operations in the latent space.
The paper is organized as follows. In Section 2
, we analyze the actions of typical operations found in deep convolutional neural networks on optical flow in the input signal, and show that these operations tend to preserve the optical flow, at least approximately, with an appropriate scale. In
Section 3 we provide empirical support for the theoretical analysis from Section 2. Finally, Section 4 concludes the paper.2 Latent space motion analysis
The basic problem studied in this paper is illustrated in Fig. 3
. Consider two images (video frames) input to the edge submodel of a CI system. It is common to represent their relationship via a motion model. The question we seek to answer here is, “what is the relationship between the corresponding feature tensors produced by the edge submodel?” To answer this question, we will look at the processing pipeline between the input image and a given channel of a feature tensor. In most deep models for computer vision applications, this processing pipeline consists of a sequence of basic operations: convolutions, pointwise nonlinearities, and pooling. We will show that each of these operations tends to preserve motion, at least approximately, in a certain sense, and from this we will conclude that (approximate) input motion may be observed in individual channels of a feature tensor.
Motion model. Optical flow is a frequently used motion model in computer vision and video processing. In a “2D+t” model, denotes pixel intensity at time , at spatial position
. Under a constantintensity assumption, optical flow satisfies the following partial differential equation
[15]:(1) 
where
represents the motion vector. For notational simplicity, in the analysis below we will use a “1D+t” model, which captures all the main ideas but keeps the equations shorter. In a “1D+t” model,
denotes intensity at position at time , and the optical flow equation is(2) 
with representing the motion. We will analyze the effect of basic operations — convolutions, pointwise nonlinearities, and pooling — on creftype 2, to gain insight into the relationship between input space motion and latent space motion.
Convolution. Let be a (spatial) filter kernel, then the optical flow after convolution is a solution to the following equation
(3) 
where is the motion after the convolution. Since the convolution and differentiation are linear operations, we have
(4) 
Hence, solution from creftype 2 is also a solution of creftype 4, but creftype 4 could also have other solutions, besides those that satisfy creftype 2.
Pointwise nonlinearity. Nonlinear activations such as sigmoid, ReLU, etc., are commonly applied in a pointwise fashion on the output of convolutions in deep models. Let denote such a pointwise nonlinearity, then the optical flow after this nonlinearity is a solution to the following equation
(5) 
where
is the motion after the pointwise nonlinearity. By using the chain rule of differentiation, the above equation can be rewritten as
(6) 
Hence, again, solution from creftype 2 is also a solution of creftype 6. It should be noted that creftype 6 may have solutions other than those from creftype 2. For example, in the region where inputs to ReLU are negative, the corresponding outputs will be zero, so . Hence, in those regions, creftype 6 will be satisfied for arbitrary . Nonetheless, the solution from creftype 2 is still one of those arbitrary solutions.
Pooling.
There are various forms of pooling, such as maxpooling, meanpooling, learnt pooling (via strided convolutions), etc. All these can be decomposed into a sequence of two operations: a spatial operation (local maximum or convolution) followed by scale change (downsampling). Spatial convolution operations can be analyzed as above, and the conclusion is that motion before such an operation is also a solution to the optical flow equation after such an operation. Hence, we will focus here on the local maximum operation and the scale change.
Local maximum. Consider the maximum of function over a local spatial region , at a given time . We can approximate as a locallylinear function, whose slope is the spatial derivative of at , . If the derivative is positive, the maximum is , and if it is negative, it is . In the special case when the derivative is zero, any point in , including the endpoints, is a maximum. From Taylor series expansion of around up to and including the firstorder term,
(7) 
for . With such linear approximation, the local maximum of over occurs either at or at , depending on the sign of ; if the derivative is zero, every point in the interval is a local maximum. Hence, the local maximum of can be approximated as
(8) 
Let creftype 8 be the definition of , the function that takes on local spatial maximum values of over windows of size . The optical flow after such a local maximum operation is described by
(9) 
where represents the motion after local spatial maximum operation. Using creftype 8 in creftype 9, after some manipulation we obtain the following equation
(10) 
Note that if satisfies the original optical flow equation creftype 3, it will also satisfy creftype 10, hence premax motion is also one possible solution to postmax motion .
Scale change. Finally, consider the change of spatial scale by a factor , such that the new signal is . The optical flow equation is now
(11) 
Since and , we conclude that , where is the solution to prescaling motion creftype 2. Hence, as expected, downscaling the signal spatially by a factor of () would reduce the motion by a factor of .
Combining the results of the above analyses, we conclude that convolutions, pointwise nonlinearities, and local maximum operations tend to be motionpreserving operations, in the sense that preoperation motion is also a solution to postoperation optical flow, at least approximately. The operation with the most obvious impact on motion is scale change. Hence, when looking at latentspace motion at some layer in a deep model, we should expect to find motion similar to the input motion, but scaled down by a factor of , where is the number of pooling operations (over windows) between the input and the layer of interest. Specifically, if is the motion vector at some position in the input frame, then at the corresponding spatial location in all the channels of the feature tensor we can expect to find the vector
(12) 
In Section 3, we will verify these conclusions experimentally.
3 Experiments
An illustration of the correspondence between the inputspace motion and latentspace motion was shown in Fig. 2. This example was produced using a pair of frames from a video of a moving car. The motion vectors were estimated using an exhaustive blockmatching search at each pixel, which sought to minimize the sum of squared differences (SSD). In the input frames, whose resolution was , the block size of around each pixel and the search range of were used. In the corresponding feature tensor channels, whose resolution was , the block size of and a search range of were used. Although the estimated motion vector fields are somewhat noisy, the similarity between the inputspace motion and latentspace motion is evident.
To examine the relationship between inputspace and latentspace motion more closely, we performed several experiments with synthetic inputspace motion. In this case, exact inputspace motion is known, so relationship creftype 12 can be tested more reliably. Fig. 4 shows examples of various transformations (translation, rotation, stretching, shearing) applied to an input image of a dog. The second column displays several channels from the actual tensor produced by the transformed image, and the third column shows the corresponding channels produced by motion compensating the tensor of the original image via creftype 12. The last column shows the difference between the actual and predicted tensor channels. Note that regions that cannot be predicted, such as regions “entering the frame,” were excluded from difference computation. As seen in Fig. 4, the model creftype 12 works reasonably well, and the differences between the actual and predicted tensors are low.
For quantitative evaluation, experiments were conducted on several layers of ResNet34 [17] and DenseNet121 [18]. Normalized Root Mean Square Error (NRMSE) [19] was used for this purpose:
(13) 
where is the actual tensor value produced from the transformed input, is the tensor value predicted using our motion model creftype 12, is the number of elements in the feature tensor, and is the dynamic range. Again, regions that cannot be predicted were excluded from NRMSE computation. Fig. 5 shows NRMSE computed across a range of parameters for several transformations, at various layers of the two DNNs.
As seen in Fig. 5, NRMSE goes up to about 0.04 for reasonable ranges of transformation parameters. How good is this? To answer this question, we set out to find the typical values of NRMSE found in conventional motioncompensated frame prediction. In a recent study [20], the quality of frames predicted by conventional motion estimation and motion compensation (MEMC) in High Efficiency Video Coding (HEVC) [21] was compared against a DNN developed for frame prediction. From Table III in [20]
, the luminance Peak Signal to Noise Ratio (PSNR) of frames predicted unidirectionally by the DNN and conventional HEVC MEMC was in the range 27–41 dB over several HEVC test sequences. NRMSE can be computed from PSNR as
(14) 
so the PSNR range of 27–41 dB corresponds to the NRMSE range of 0.009–0.044. These levels of NRMSE are indicative of how much motion models used in video coding deviate from the true motion. As seen in Fig. 5, the model creftype 12 produces NRMSE in the same range, so the accuracy of creftype 12 is comparable to the accuracy of common motion models used in video coding. Another illustration of this is presented in Fig. 6, which shows the histogram of NRMSE computed across a range of affine transformation parameters. Hence, creftype 12 represents a good starting point for development of latentspace motion compensation.
4 Conclusions
Using the concept of optical flow, in this paper we analyzed motion in the latent space of a deep model induced by the motion in the input space, and showed that motion tends to be approximately preserved in the channels of intermediate feature tensors. These findings suggest that motion estimation, compensation, and analysis methods developed for conventional video signals should be able to provide a good starting point for latentspace motion processing, such as motioncompensated prediction and compression, tracking, action recognition, and other applications.
References
 [1] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE ICASSP, 2021, to appear.
 [2] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proc. 22nd ACM Int. Conf. Arch. Support Programming Languages and Operating Syst., 2017, pp. 615–629.
 [3] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Trans. Mobile Computing, 2019, Early Access.
 [4] M. Ulhaq and I. V. Bajić, “Shared mobilecloud inference for collaborative intelligence,” arXiv:2002.00157, 2019, NeurIPS’19 demonstration.

[5]
H. Choi and I. V. Bajić,
“Deep feature compression for collaborative object detection,”
in Proc. IEEE ICIP, Oct. 2018, pp. 3743–3747.  [6] H. Choi and I. V. Bajić, “Nearlossless deep feature compression for collaborative intelligence,” in Proc. IEEE MMSP, Aug. 2018, pp. 1–6.
 [7] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C. Kot, “Toward intelligent sensing: Intermediate deep feature compression,” IEEE Trans. Image Processing, vol. 29, pp. 2230–2243, 2019.
 [8] ISO/IEC, “Draft call for evidence for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 11 W19508, Jul. 2020.
 [9] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Transactions on Image Processing, vol. 29, pp. 8680–8695, 2020.
 [10] S. R. Alvar and I. V. Bajić, “Multitask learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP, Sep. 2019, pp. 1705–1709.
 [11] H. Choi, R. A. Cohen, and I. V. Bajić, “Backandforth prediction for deep tensor compression,” in Proc. IEEE ICASSP, 2020, pp. 4467–4471.
 [12] S. R. Alvar and I. V. Bajić, “Bit allocation for multitask collaborative intelligence,” in Proc. IEEE ICASSP, May 2020, pp. 4342–4346.
 [13] R. A. Cohen, H. Choi, and I. V. Bajić, “Lightweight compression of neural network feature tensors for collaborative intelligence,” in Proc. IEEE ICME, Jul. 2020, pp. 1–6.
 [14] S. R. Alvar and I. V. Bajić, “Paretooptimal bit allocation for collaborative intelligence,” arXiv:2009.12430, Sep. 2020.
 [15] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, no. 1, pp. 185 – 203, 1981.
 [16] Y. Wang, J. Ostermann, and Y.Q. Zhang, Video Processing and Communications, PrenticeHall, 2002.
 [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016, pp. 770–778.
 [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE CVPR, 2017, pp. 2261–2269.
 [19] Wikipedia contributors, “Rootmeansquare deviation — Wikipedia, the free encyclopedia,” 2020, [Online] Available: https://en.wikipedia.org/wiki/Rootmeansquare_deviation.
 [20] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1843–1855, Jul. 2020.
 [21] ITU, “High efficiency video coding,” Recommendation ITUT H.265, Nov. 2019.
Comments
There are no comments yet.