This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

Physics Constrained Flow Neural Network for Millisecond-timescale Predictions in Data Communication Networks

Xiangle Cheng1    Shihan Xiao1    Yingxue Zhang2    Zhitang Chen3&Pascal Poupart4 1Huawei 2012 Network Technology Lab
2Huawei Technologies Canada
3Huawei Noah’s Ark Lab
4University of Waterloo {chengxiangle1, xiaoshihan, yingxue.zhang, chenzhitang2}@huawei.com, [email protected]
Abstract

Machine learning is gaining momentum in various recent models for the dynamic analysis of information flows in data communication networks. These preliminary models often rely on off-the-shelf learning techniques to make predictions based on historical statistics while disregarding the physics governing the generative process of these flows. Instead, this paper introduces Flow Neural Network (FlowNN) to improve short-timescale predictions with physical knowledge. This is implemented by embedding the logics of network congestion control and hop-by-hop forwarding into the data encoding layer. A self-supervised learning strategy with stop-gradient is also designed to improve the transferability of the learned physical logics. For milisecond-timescale flow prediction tasks, FlowNN decreases the loss by 17% \sim 71% in comparison to state-of-the-art baselines on both synthetic and real-world networking datasets, which shows the strength of this new approach. Code will be made available.

1 Introduction

Data communication networks provide the majority of data transmission services to support today’s internet applications and present huge social value. This paper proposes a dynamic analysis model to deal with flows of information from sources to destinations in data communication networks. As depicted in Fig. 1, the source of a flow is a node (computer, phone, router/switch, etc.) in the network from which packets111Information is encapsulated in packets, which can be seen as particles travelling in the network. start their travel and the destination is where they end. Throughout the lifespan of a flow (e.g., a 10-minute phone call or a 2-hour online video), packets consistently travel along the assigned routing path connecting the source and destination. Depending on the real-time congestion conditions of nodes on the routing path, packets may experience different buffering or retransmission delays, resulting in varying transmission rates and service delays.

Refer to caption
Figure 1: Packet flows in data communication networks. It is better to predict the bottom-right 1Gbps1~{}Gbps by referring to only the upper-left 1Gbps1~{}Gbps and excluding the information of 0Gbps0~{}Gbps at the bottom left.

A good understanding of the behavioral patterns of the underlying packets and flows plays a crucial role in network planning, traffic management, as well as optimizing Quality of Service (QoS, e.g., transmission rate, delay, etc.). However, the highly nonlinear dynamics of the networking system often create challenges for a fine-grained analysis. Particularly, at the milisecond timescale, the traffic traces are highly skewed and it is difficult to recognize obvious patterns.

Therefore, recent works often rely on off-the-shelf machine learning models to make predictions based on the data statistics of a flow. The remarkable advances achieved in long-timescale prediction tasks (e.g., at seconds or minutes intervals Marcus et al. (2020); Mozo et al. (2018)) are a testament to this minimalist principle. A key limitation of these machine learning models is that they disregard the physics governing the generative process of these packet flows.

In practice, the dynamics of spatio-temporal flows are governed by physical laws or models regulating the networking system. For instance, in networks with a TCP/IP protocol suite, the real-time transmission rates of different packet sources are typically modulated dynamically by congestion control models to prevent congestive collapse Nathan et al. (2019). The conservation law describes the macroscopic fluid dynamics of data flows during hop-by-hop transmissions Apice et al. (2008). The physics establishes an inductive bias Peter et al. (2018) when predicting from history. That is, the future is not entirely random or noisy even when observed at a millisecond granularity, but can be predicted from physical models or laws.

As illustrated in Fig. 1, assume 1MB1MB of data is sent out from the source node n0n_{0} (i.e., Sender1) at the first time interval t=1mst=1ms. After a certain delay (e.g., 1ms1ms) due to link propagation and packet processing, the first 1MB1MB of data will arrive at its next routing node R1R_{1} at t=2mst=2ms. In this case, at t=1mst=1ms, the sending rate at the source node n0n_{0} and the receiving rate at router R1R_{1} are xr,n0t1=1Gbpsx_{r,n_{0}}^{t_{1}}=1Gbps and xr,R1t1=0Gbpsx_{r,R_{1}}^{t_{1}}=0Gbps, respectively. At t=2mst=2ms, the associated rate at R1R_{1} becomes xr,R1t2=1Gbpsx_{r,R_{1}}^{t_{2}}=1Gbps, while xr,n0t2x_{r,n_{0}}^{t_{2}} is again determined by the congestion control model according to the acknowledged ACK information. Such forwarding process will continue hop by hop until reaching the destination (Receiver1).

The above hop-by-hop data transmission establishes a special connection between space and time. We can observe in Fig. 1 that xr,R1t2=1Gbpsx_{r,R_{1}}^{t_{2}}=1Gbps is physically determined by xr,n0t1=1Gbpsx_{r,n_{0}}^{t_{1}}=1Gbps, the history information of its predecessor node (i.e, source n0n_{0}) rather than its own history record xr,R1t1=0Gbpsx_{r,R_{1}}^{t_{1}}=0Gbps. Therefore, such time-resolved flow data not only tells us who is related to whom, but also when and in which order relations occur. This provides a specific relational bias for modelling the spatio-temporal evolution of such flow data.

In this paper, we investigate how to embed such essential physical bias in a learning model to improve the short-timescale (millisecond-level) flow predictions. Accordingly, we propose Flow Neural Network (FlowNN), the first customized learning model for the data analysis of networking flows, while respecting packet routing structures and network mechanisms.

Our contributions can be summarized as follows: (1) We develop FlowNN, a prediction model that exploits data correlations physically induced by the congestion control model and flow conservation rule during hop-by-hop data forwarding; (2) We design a self-supervised learning strategy with stop-gradient that can learn the feature representation and physical bias simultaneously at each gradient update step; (3) We show with realistic packet datasets that FlowNN achieves consistently better results and outperforms the incumbent baselines by reducing the loss by 17% \sim 71% for short-timescale prediction problems.

2 Preliminaries

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) Data of a packet flow in the form of time-series tensor. The blue curve characterizes the data correlations propagated along the space-time dimensions; (b) A sample of data rates of a flow and the observed rate difference between two neighboring routing nodes. The sequences 𝒮\mathcal{S} in source time window T1T_{1} are highly correlated with the sequences 𝒯\mathcal{T} in target window T2T_{2} by the effect of flow conservation.

Packet Flow. An ss-dd flow is a sequence of packets encapsulating the message bits and getting forwarded from source ss along an assigned routing path PLP\in\mathbb{Z}^{L} to destination dd. Let xf,ntx_{f,n}^{t}\in\mathbb{R} be the value of feature ff measured at node nPn\in P during the time interval tt. For instance, xf,ntx_{f,n}^{t} can be the average packet travelling delay or transmission/receiving rate of a flow. Let us denote the time-series tensor of a flow as Xt={xnt}n=1,,LL×NfX^{t}=\{x_{n}^{t}\}_{n=1,\cdots,L}\in\mathbb{R}^{L\times N_{f}}. With telemetry monitors installed at each node, we can collect the running traces of XtX^{t} as shown in Fig. 2(a).

Packet Flow Prediction Problem. Given the time-series tensor XtX^{t} of a flow in the form of Fig. 2(a), predict its future service qualities xf,nt+1x_{f,n}^{t+1}, such as next-time-step delay or transmission/receiving rate.

Congestion Control. A congestion control model can be regarded as mapping a locally-perceived history of feedback from the receiver, which reflects past traffic and network conditions, to the next choice of sending rate Nathan et al. (2019). Such built-in controls in today’s networking systems can avoid congestive collapse arising from overloading.

Flow Conservation. For a packet flow with a given routing path, we have the following conservation constraint222Packets may be lost when congestion occurs. With congestion control models and buffering mechanisms, the loss probability is quite small (<0.1%<0.1\%) in current commercial networking systems and can be tolerated by the statistical data learning process.:

txr,nt=txr,n+1t=bn,(n,n+1)P\sum_{t}x^{t}_{r,n}=\sum_{t}x^{t}_{r,n+1}=b_{n},\forall(n,n+1)\subseteq P (1)

where bnb_{n} are the cumulative number of flow bits the source sends out.

Eq. 1 analytically constrains the sending and receiving bit rates at each node along the routing path. As a consequence, without congestion, the bit rates observed at each routing node will be exactly the same. Nevertheless, if routing nodes experience different congestion due to link intersection, the observed rates between two neighboring routing nodes will become larger and smaller over time during the flow lifespan. As shown in Fig. 2(b), this explicitly forms a set of paired source-target time windows within the time series of two neighboring nodes. During the source window T1T_{1}, a node (e.g., node 1) presents larger rates than its successor node (node 2). Then, in the target window T2T_{2}, node 1 will present smaller rates than node 2. Finally, the cumulative amount of forwarded flow bits are conserved among the routing nodes. Such property arises from the routing structure and packet buffering mechanism in communications networks, and is independent of network and flow configurations.

Accurately modelling the flow dynamics requires attention to the above domain-specific physics and properties, which are absent in most (if not all) existing learning models.

Next, we introduce FlowNN to embed the physics as a learning bias to improve data analysis.

Remark 1: We are not resorting to learning a strict conservation constraint with FlowNN. Instead, we introduce the paired source-target windows as a soft constraint or relational inductive bias Peter et al. (2018) to improve predictions by referring only to the spatio-temporal data filtered by each paired window. For example, in Fig. 1, xr,n0t1=1Gbpsx_{r,n_{0}}^{t_{1}}=1Gbps is helpful for predicting xr,R1t2=1Gbpsx_{r,R_{1}}^{t_{2}}=1Gbps, while xr,R1t1=0Gbpsx_{r,R_{1}}^{t_{1}}=0Gbps is noisy information and thus should be excluded for such prediction.

3 Flow Neural Network

Table 1: Notation of key symbols
L,Nf,d,NL,N_{f},d,N length of a routing path PP, number of features, number of hidden dimensions and number of recurrent iterations in FlowNN, respectively
xf,ntx_{f,n}^{t}\in\mathbb{R} value of feature ff at node nPn\in P, and xr,ntx_{r,n}^{t} specifically denotes the transmit/receive rates
xntNfx_{n}^{t}\in\mathbb{R}^{N_{f}} a set including all NfN_{f} features at nn and tt
XtL×NfX^{t}\in\mathbb{R}^{L\times N_{f}} time-series tensor of a flow at tt
h^ntd\hat{h}_{n}^{t}\in\mathbb{R}^{d} hidden vector of node nn at tt
fL1f_{L_{1}} neural net for initial embedding of xntx_{n}^{t}
fL2f_{L_{2}} compound neural net with a multi-layer perceptron (MLP) and a recurrent net to aggregate all hnth_{n}^{t} along dimension-nn and then dimension-tt
fL3f_{L_{3}} compound neural net with a MLP to aggregate the states of two neighboring nodes at aligned tt, and a Seq2Seq net to predict the target window from its correlated source window
Refer to caption
Figure 3: FlowNN architecture.

Our architecture (Fig. 3) consists of three learnable neural network functions, denoted as fL1f_{L_{1}}, fL2f_{L_{2}}, fL3f_{L_{3}}, respectively. FlowNN works upon the initial embedding layer L1L_{1} to further impose the learning bias from the working mechanism of physical systems.

Concretely, for the time-series tensor of each flow Xt={xnt}n=1,,LL×NfX^{t}=\{x_{n}^{t}\}_{n=1,\cdots,L}\in\mathbb{R}^{L\times N_{f}}, fL1f_{L_{1}} use a MLP to project each xntNfx_{n}^{t}\in\mathbb{R}^{N_{f}} to latent space d\mathbb{R}^{d}:

hnτ=fL1(xnτ),τ=t0,,t;n=1,,Lh_{n}^{\tau}=f_{L_{1}}(x_{n}^{\tau}),\forall\tau=t_{0},\cdots,t;n=1,\cdots,L (2)

Then, PathAggregator fL2f_{L_{2}} uses a MLP to aggregate Hτ={hnτ}n=1,,LL×dH^{\tau}=\{h_{n}^{\tau}\}_{n=1,\cdots,L}\in\mathbb{R}^{L\times d} from layer L1L_{1} along dimension-nn and a recurrent net (e.g., GRU) to aggregate over time τ\tau:

h^1τ=fL2(Hτ)=GRU(MLP(Hτ))d\hat{h}_{1}^{\tau}=f_{L_{2}}(H^{\tau})=GRU(MLP(H^{\tau}))\in\mathbb{R}^{d} (3)

The output of PathAggregator, h^1τ\hat{h}_{1}^{\tau}, aggregates the locally-perceived traffic and network conditions along the routing path, which can be mapped to its next sending rate from the source node. It physically plays the role of a simulator to imitate the decision process of existing congestion control models.

With the aggregated path state h^1t\hat{h}_{1}^{t} and the local hidden states of individual nodes, {hnt}n=2,,L\{h_{n}^{t}\}_{n=2,\cdots,L} from fL1f_{L_{1}}, the induction layer L3L_{3} is then applied to estimate the new local states of each node under the new sending state (i.e., h^1τ\hat{h}_{1}^{\tau}). Specifically, for each pair333\langle\cdot\rangle denotes the paired set of the two included elements. of two neighboring nodes n,n+1\langle n,n+1\rangle, n=1,,L1n=1,\cdots,L-1, we split the associated time-series h^nτ\hat{h}_{n}^{\tau} into a set of correlation windows T1i,T2ii=1,2,\langle T_{1}^{i},T_{2}^{i}\rangle_{i=1,2,\cdots} such that the boundaries of each time window correspond to timestamps where the rate difference between two neighboring nodes is 0, i.e., xr,nτxr,n+1τ=0x_{r,n}^{\tau}-x_{r,n+1}^{\tau}=0 (see Fig. 2(b)). For each correlated window T1i,T2i\langle T_{1}^{i},T_{2}^{i}\rangle, we encode the hidden states of sequences in T1iT_{1}^{i} and T2iT_{2}^{i} as follows:

h^n+1τ=MLP({h^1τ||hn+1τ})d,τT1i\displaystyle\hat{h}^{\tau}_{n+1}=MLP\big{(}\{\hat{h}_{1}^{\tau}~{}||~{}h_{n+1}^{\tau}\}\big{)}\in\mathbb{R}^{d},\forall\tau\in T_{1}^{i} (4)
h^n+1τ=Seq2Seq(hn+1τ1|{h^1t}tT1i)d,τT2i\displaystyle\hat{h}_{n+1}^{\tau}=Seq2Seq\big{(}h_{n+1}^{\tau-1}~{}|~{}\{\hat{h}_{1}^{t}\}_{t\in T_{1}^{i}}\big{)}\in\mathbb{R}^{d},\forall\tau\in T_{2}^{i} (5)

where Seq2SeqSeq2Seq is a sequence to sequence net Sutskever et al. (2014) to predict the target sequences {h^n+1τ}τT2i\{\hat{h}_{n+1}^{\tau}\}_{\tau\in T_{2}^{i}} conditioned on the states of source sequences {h^1t}tT1i\{\hat{h}_{1}^{t}\}_{t\in T_{1}^{i}}. In the experiments, we used both GRU-type encoders and decoders to implement the Seq2SeqSeq2Seq net. Here, |||| and || are concatenation and conditional operators, respectively.

Eq. 4 introduces path-level messages (i.e., h^1τ\hat{h}_{1}^{\tau}) into the locally-perceived states of node n+1n+1 for all τT1i\tau\in T_{1}^{i}. Eq. 5 predicts the states of sequences in T2iT_{2}^{i} at n+1n+1 conditioned on the states of history window T1iT_{1}^{i}.

The steps from Eq. 35 explicitly embed the natural data connections in Fig. 2(a) arising from the congestion control and hop-by-hop data transmission. As shown in Fig. 3, the predicted outputs {h^nt}n=1,,L\{\hat{h}_{n}^{t}\}_{n=1,\cdots,L} can be recurrently fed into the same PathAggregator L2L_{2} and Induction L3L_{3} to propagate messages by the rules of Eq. 35 to farther distances. The number NN of recurrent iterations is a tunable hyperparameter.

4 Self-Supervised Training Strategy

Refer to caption
Figure 4: Self-supervised FlowNN training architecture.

The self-supervised FlowNN training architecture (Fig. 8) has two branches, working on a shared initial embedding layer (encoder) fL1f_{L_{1}}. The bottom branch predicts the states at time tt based on the information up to time t1t-1 by using fpf_{p} (i.e., PathAggregator L2L_{2} and Induction L3L_{3}) to impose the physical bias upon the outputs of fL1f_{L_{1}}. The top branch embeds the ground truth features at time tt with a MLP projection fmf_{m} that transforms the outputs from fL1f_{L_{1}}. Denote the output vectors of the two branches as 𝒵=(fm(hnt))n=1,,L\mathcal{Z}=(f_{m}(h_{n}^{t}))_{n=1,\cdots,L} and 𝒵^={h^nt}n=1,,L\hat{\mathcal{Z}}=\{\hat{h}_{n}^{t}\}_{n=1,\cdots,L}. Similar to Grill et al. (2020); Chen and He (2021), we define the following symmetrized loss to minimize their negative cosine similarity:

=cos(stopgrad(𝒵^),𝒵)2cos(𝒵^,stopgrad(𝒵))2\mathcal{L}=-{\mathrm{cos}(\mathrm{stopgrad}(\hat{\mathcal{Z}}),\mathcal{Z})\over{2}}-{\mathrm{cos}(\hat{\mathcal{Z}},\mathrm{stopgrad}(\mathcal{Z}))\over{2}} (6)

where the stopgrad(𝒵^)\mathrm{stopgrad}(\hat{\mathcal{Z}}) operation detaches 𝒵^\hat{\mathcal{Z}} from the gradient computation, and thus fpf_{p} receives no gradient from the first term of loss \mathcal{L} (and vice versa for 𝒵\mathcal{Z}).

The training strategy in Fig. 8 improves Contrastive Predictive Coding Oord et al. (2019) with a stop-gradient operation. As annotated in Fig. 8, the output of the bottom pipeline is the predicted hidden vectors (the information at tt is NOT included when predicting for time tt with fpf_{p}). If we first remove the projector fmf_{m}, the upper pipeline produces the true encoded hidden vectors from fL1f_{L_{1}} with the information at tt included (thus this can be treated as the ground truth of hidden vectors in latent space). The loss here is constructed to make the predicted hidden vectors agree with the true encoded hidden vectors. The pretraining process tries to learn with FlowNN the physical bias arising from the built-in mechanisms of networks. Here, the fmf_{m} is added to approximate the expectation of all possible inputs from each given input sample. This makes the learned physical bias from FlowNN stable across different inputs. This precisely explains the good transferability of the pretrained FlowNN in cross-task tests (Table 3) and Out-Of-Distribution tests (Table 4). Refer to Appendix C for more details.

5 Experiments

We perform experiments to predict the end-to-end packet transmission delay and sending/receiving rates xr,nt+1x_{r,n}^{t+1} along the routing path of each individual flow. The predictions of these two features can provide critical information for the transmission optimization for each service.

Datasets. Four publicly available datasets–WIDE, NUMFabric Nagaraj et al. (2016), NSF and GM50 UPC (2020) are used. WIDE consists of realistic traffic traces collected from real-world network environments WIDE (2021), and the other datasets consist of synthetic traffic widely used in recent networking research Nagaraj et al. (2016); Lei et al. (2019). We sampled 50 flows at the timescale of 1ms1ms for a total length of 30000ms30000ms. The raw features included in the time-series tensor of each flow are the sending/receiving rates, as well as the associated aggregated rates of all background flows444At each routing node of a flow, all other flows traversing the same node are in the background of the given flow. at each routing node. Each flow has a routing path with 5 nodes in WIDE and NUMFabric, and variable-length routing paths ranging from 2 to 7 in NSF and GM50. In all experiments, we split the time series of 40 flows to train, validate and test by ratio 6:2:2, and the remaining 10 flows are retained to test the performance of the model on unseen flows after training.

Baselines. We have compared FlowNN with the following most related baselines: 1) Auto-Regression Integrated Moving Average (ARIMA) Wang et al. (2006); 2) GRU Chung et al. (2014); 3) multivariate GRU (m-GRU); 4) STHGCN Marcus et al. (2020), and 5) STGCN Yu et al. (2018). Particularly, GRU encodes and predicts each time-series sequence independently without any reference to the information from other spatial nodes. In contrast, m-GRU processes joint information from all routing nodes to predict. STHGCN is the very recent work for networking traffic analysis, which uses graph convolutional networks Kipf and Welling (2016) and GRUs to predict the spatial states of all nodes at each time. Finally, STGCN is a widely used model for transportation traffic analysis, which is built upon the spatio-temporal graph convolution net.

The compared baselines treat the time-series tensor of a flow as multivariate input (GRU, m-GRU) or graph-type input (STHGCN, STGCN), which are the dominant approaches to model the spatio-temporal relationships in the existing studies. Note that FlowNN reduces to m-GRU model if we remove the induction layer L3L_{3} and set the number of iterations to N=1N=1 in Fig. 3. Therefore, we did not perform further ablation studies in the following experiments. We conducted a grid search over the hyper-parameters for all baselines. The recurrent iteration N=2N=2 was chosen. Experiments were conducted on a single Nvidia P40 24GB GPU. More details about the model implementations and hyper-parameter search procedures are given in Appendix E.

Evaluation metrics. We used the following metrics to evaluate the prediction quality of y^\hat{y} with the ground truth yy.

Mean squared error: MSE=1ni=1n(yiy^i)2\text{Mean squared error: }\quad\quad MSE={1\over{n}}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2} (7)
Relative squared error: RSE=i=1n(yiy^i)2i=1n(yiy¯)2\text{Relative squared error: \ }\quad RSE={\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}\over{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}} (8)

  Correlation coefficient:

Corr=i=1n(yiy¯)(y^iy¯^)i=1n(yiy¯i)2i=1n(y^iy¯^)2Corr={\sum_{i=1}^{n}(y_{i}-\bar{y})(\hat{y}_{i}-\hat{\bar{y}})\over{\sqrt{\sum_{i=1}^{n}{(y_{i}-\bar{y}_{i})^{2}}}\sqrt{\sum_{i=1}^{n}{(\hat{y}_{i}-\hat{\bar{y}})^{2}}}}} (9)

where y¯\bar{y} and y¯^\hat{\bar{y}} are the mean values of yy and y^\hat{y}.

Table 2: Test performance comparisons
Test on seen flows Test on unseen flows
Dataset||Model MSE RSE Corr MSE RSE Corr
NUMFabric Naive 0.3567 0.6175 0.8091  –  –  –
ARIMA 0.2724 0.5199 0.8543  –  –  –
GRU 0.0872±\pm0.0013 0.3021±\pm0.0018 0.9533±\pm0.0006 0.0895±\pm0.0033 0.3092±\pm0.0038 0.9510±\pm0.0013
m-GRU 0.0306±\pm0.0004 0.1789±\pm0.0013 0.9840±\pm0.0002 0.0311±\pm0.0010 0.1823±\pm0.0025 0.9833±\pm0.0005
STHGCN 0.0405±\pm0.0006 0.2060±\pm0.0013 0.9785±\pm0.0003 0.0417±\pm0.0014 0.2109±\pm0.0026 0.9775±\pm0.0006
STGCN 0.0457±\pm0.0007 0.2186±\pm0.0018 0.9758±\pm0.0006 0.0453±\pm0.0012 0.2200±\pm0.0039 0.9755±\pm0.0018
FlowNN 0.0254±\pm0.0003 0.1632±\pm0.0010 0.9867±\pm0.0002 0.0261±\pm0.0009 0.1672±\pm0.0021 0.9860±\pm0.0004
WIDE Naive 0.8384 0.9420 0.5533  –  –  –
ARIMA 0.4879 0.6422 0.7668  –  –  –
GRU 0.5857±\pm0.0074 0.7764±\pm0.0057 0.6325±\pm0.0071 0.5508±\pm0.0113 0.7539±\pm0.0088 0.6585±\pm0.0103
m-GRU 0.4609±\pm0.0061 0.6887±\pm0.0059 0.7271±\pm0.0055 0.4490±\pm0.0087 0.6807±\pm0.0086 0.7345±\pm0.0080
STHGCN 0.4878±\pm0.0063 0.7085±\pm0.0056 0.7079±\pm0.0055 0.4710±\pm0.0096 0.6971±\pm0.0084 0.7191±\pm0.0084
STGCN 0.4967±\pm0.0021 0.7151±\pm0.0013 0.7073±\pm0.0014 0.4695±\pm0.0048 0.6961±\pm0.0032 0.7258±\pm0.0034
FlowNN 0.4123±\pm0.0054 0.6514±\pm0.0058 0.7599±\pm0.0050 0.4007±\pm0.0078 0.6430±\pm0.0085 0.7674±\pm0.0073
NSF Naive 0.2244 0.5006 0.8742  –  –  –
ARIMA 0.2240 0.4852 0.8745  –  –  –
GRU 0.1168±\pm0.0030 0.3576±\pm0.0046 0.9341±\pm0.0018 0.1140±\pm0.0047 0.3814±\pm0.0052 0.9253±\pm0.0021
m-GRU 0.0794±\pm0.0022 0.2948±\pm0.0041 0.9556±\pm0.0013 0.0888±\pm0.0040 0.3366±\pm0.0050 0.9423±\pm0.0018
STHGCN 0.0938±\pm0.0024 0.3204±\pm0.0042 0.9474±\pm0.0014 0.1302±\pm0.0065 0.4075±\pm0.0071 0.9157±\pm0.0028
STGCN 0.1214±\pm0.0016 0.3643±\pm0.0027 0.9333±\pm0.0013 0.1222±\pm0.0034 0.3944±\pm0.0055 0.9213±\pm0.0035
FlowNN 0.0791±\pm0.0021 0.2942±\pm0.0039 0.9558±\pm0.0012 0.0851±\pm0.0035 0.3296±\pm0.0043 0.9446±\pm0.0015
GM50 Naive 1.3387 1.1929 0.2806  –  –  –
ARIMA 0.5383 0.7687 0.6410  –  –  –
GRU 0.6381±\pm0.0151 0.8179±\pm0.0113 0.5756±\pm0.0164 0.8921±\pm0.0116 0.9494±\pm0.0064 0.3148±\pm0.0274
m-GRU 0.5628±\pm0.0143 0.7682±\pm0.0116 0.6405±\pm0.0134 0.8168±\pm0.0110 0.9084±\pm0.0064 0.4184±\pm0.0147
STHGCN 0.5853±\pm0.0149 0.7834±\pm0.0117 0.6218±\pm0.0143 0.8427±\pm0.0123 0.9227±\pm0.0076 0.3866±\pm0.0175
STGCN 0.5877±\pm0.0021 0.7850±\pm0.0017 0.6206±\pm0.0023 0.8329±\pm0.0051 0.9175±\pm0.0020 0.3997±\pm0.0040
FlowNN 0.5429±\pm0.0138 0.7544±\pm0.0114 0.6567±\pm0.0125 0.7958±\pm0.0106 0.8967±\pm0.0061 0.4431±\pm0.0131

5.0.1 Linear evaluation on rate prediction task.

We first evaluate the learned latent representation {h^nt}n=1,,L\{\hat{h}_{n}^{t}\}_{n=1,\cdots,L} for the rate prediction task by finetuning a MLP readout layer on top of the pretrained FlowNN model. During the finetuning process, the pretrained FlowNN model is further optimized according to ground-truth labels. By contrast, the compared baselines are directly trained together with their MLP readout layers.

Table 2 reports the test performance of next-step (1ms1ms) predictions. Here, the Naive approach simply predicts the next-step value to be the same as the current value. The results of Naive and ARIMA provide evidence that traditional statistical time-series prediction models fail to provide reasonable performance on short-timescale prediction tasks. We can observe that FlowNN outperforms all baselines, achieving a MSE decrease of up to 71% (GRU), 17% (m-GRU), 36.8% (STHGCN), and 44.4% (STGCN).

m-GRU is similar to the function of PathAggregator in FlowNN, which also integrates the effect of congestion control in communication networks. This explains the reason behind its superiority when compared with other baselines in this task (but it does not perform well in the following cross-task learning test (Table 3) and model generality experiments (Table 4).). More details and intuition are provided in Appendix B to interpret why the involved recurrent and graph convolutional operators in these baselines will fail to capture the data correlations manifested in the time-series tensors.

5.0.2 Transfer to end-to-end packet delay prediction task.

Table 3: Test performances of end-to-end packet transmission delay on unseen flows
* MSE RSE Corr
NUMFabric 1 0.1051±\pm0.0046 0.3204±\pm0.0046 0.9473±\pm0.0012
2 0.0386±\pm0.0037 0.1942±\pm0.0057 0.9811±\pm0.0012
3 0.0512±\pm0.0036 0.2236±\pm0.0053 0.9747±\pm0.0013
4 0.0346±\pm0.0022 0.1841±\pm0.0107 0.9831±\pm0.0079
5 0.0336±\pm0.0032 0.1811±\pm0.0057 0.9836±\pm0.0011
WIDE 1 0.4002±\pm0.0594 1.0371±\pm0.0147 0.0355±\pm0.0086
2 0.3977±\pm0.0768 1.0340±\pm0.0675 0.3968±\pm0.0292
3 0.3877±\pm0.0558 1.0209±\pm0.0270 0.2013±\pm0.0230
4 0.4335±\pm0.0609 1.0817±\pm0.0406 0.2148±\pm0.0058
5 0.3168±\pm0.0405 0.9228±\pm0.0735 0.4571±\pm0.0369
* :   1: GRU, 2: m-GRU, 3: STHGCN, 4: STGCN, 5: FlowNN

It is important to guarantee that the provided end-to-end transmission delays meet users’ agreements, such as a delay of less than 50ms for online gaming. In real-world networks, many factors may influence the delay, including dynamic traffic loads, packet congestion and queueing, etc. It is difficult to build an accurate delay model even for human experts Xiao et al. (2018); Rusek et al. (2019). As explained in Congestion Control, the dynamics of the past traffic and networking conditions will influence the next-time sending actions and the overall transmission delay. This makes it possible to predict the delay from the generating behaviors of flow rates.

In this task, we work on the same FlowNN model pretrained in next-step prediction tasks, and finetune a new MLP readout to predict the next-time-step transmission delay. Table 3 shows the test performances of different models. Although pretrained on the dataset of packet transmission rates, FlowNN still achieves the best results on the task of predicting a physical feature that is never included in the pretraining process. This shows the transferability of the self-supervised pretraining model of FlowNN across different learning tasks.

5.0.3 Generality test on Out-Of-Distribution dataset.

Table 4: Out-of-distribution tests
* MSE RSE Corr
1 0.5864±\pm0.0123 0.7779±\pm0.0085 0.6284±\pm0.0108
2 0.5232±\pm0.0111 0.7347±\pm0.0076 0.6785±\pm0.0085
3 0.5003±\pm0.0107 0.7185±\pm0.0083 0.6955±\pm0.0088
4 0.4858±\pm0.0050 0.7081±\pm0.0033 0.7100±\pm0.0035
5 0.3959±\pm0.0078 0.6391±\pm0.0084 0.7692±\pm0.0072
* 1: GRU, 2: m-GRU, 3: STHGCN, 4: STGCN, 5: FlowNN

We further test the model generality when a FlowNN model, trained on one dataset, say NUMFabric, is used to work in an environment that is different from the trained one. Specifically, we froze the FlowNN model trained for the next-step prediction task on the dataset NUMFabric, and finetuned a new MLP readout to test its prediction performance on the dataset WIDE. For a fair comparison, we also finetuned the readout layers of the frozen baselines trained on NUMFabric to test their performances on WIDE. From the results in Table 4, we can observe that FlowNN remains the best performer. We conjecture that the source of improvement is the physical logic properly encoded in our FlowNN design, which is possible for all the flows or environments that work according to the same principles. By contrast, m-GRU degrades remarkably and gets worse than the two graph models. This shows the good generality and robustness of FlowNN, as well as the advantage of physics modeling in our approach.

6 Related Work

Traditional analytical models. The past decades have seen numerous packet analysis models proposed to mathematically model the network dynamics Gebali (2015). For example, extensive studies use the Poisson model to characterize the traffic by assuming that the arrival pattern between two successive packets follows a Poisson process. Considering the heavy tailed distribution and burstiness of the data-center traffic, recent work in Benson et al. (2010) models the packet arrival pattern as a log-normal process. To capture the temporal patterns and make predictions accordingly, ARIMA is exploited in Ergenc (2019) to model the time-series packet traffic. These analytical models often make important assumptions. Moreover, these statistical models work at coarse timescales (e.g., hours) and assume relatively smoother traffic patterns. However, many tasks require the flow statistics at a subsecond timescale, e.g., packet forwarding/queueing configurations and real-time networking control. This implies that tasks requiring analysis models at finer-grained time scales are beyond the capability of these traditional models.

Neural network based learning models. With data-driven learning, packet analysis can be done by extracting the spatio-temporal patterns within the time-series data. In this field, a wide range of existing neural networks fit this task, including Convolutional Neural Nets (CNNs, Mozo et al. (2018)), Graph Neural Nets (GNNs, Rusek et al. (2019)), Recurrent Neural Nets (RNNs) as well as their variants and combinations (e.g., STHGCN Marcus et al. (2020), STGCN Yu et al. (2018), and Xiao et al. (2018); Bernardez et al. (2021)). The designs tailored to data-specific properties enable the success of these existing models in their dedicated domains, such as the convolutional operation to capture the spatially local correlations in CNNs, and the aggregation operation to capture correlations between adjacent links in GNNs, and so on. As systems with clear structure and working mechanisms, communication networks exhibit their own specific data properties, which are difficult to capture for the incumbent models without any modification. Moreover, existing work only targets coarse-grained timescales above minutes or even hours. Models at a sub-second granularity, such as FlowNN, require a non-trivial combination of spatio-temporal data trends, system structure and working mechanisms.

Similar to existing learning models, FlowNN uses a finetuning stage to adapt a pretrained model to various downstream tasks or environments. Nevertheless, the self-supervised learning strategy used by FlowNN is able to embed the physical bias that is universally transferable, provided that the new environment follows the same working mechanisms. The patterns manifested by the data itself may vary in different environments, but the underlying physical working mechanisms typically remain invariant as long as the new environment follows the same principles, e.g., TCP/IP.

7 Conclusion

In this paper, we developed a customized neural network–FlowNN with the associated self-supervised training strategy to improve the short-timescale prediction problem of information flows in communication networks. This study pioneers the design of a customized neural network and learning strategy by combining network structure and working mechanisms. We reported state-of-the-art performances for multiple practical networking applications, which demonstrates the strength of our approach. As a new ‘playground’ for both networking and deep learning communities, the research of network intelligence is still in its infancy. We hope this work will inspire more innovation in this field in the future.

Appendix A Sending Packets of Data with Congestion Control

Refer to caption
Figure 5: Encapsulation format of TCP/IP packet.
Refer to caption
Figure 6: Data transmissions between two end hosts.

In a communication network with Transmission Control Protocol/Internet Protocol (TCP/IP) suits, users’ messages are encapsulated into a set of small IP packets. Fig. 66 illustrate the IP packet format and the sending process of IP packets.

A.0.1 Congestion control

When a TCP connection (or flow) is established between two end hosts, the sender will stream data packets to its receiver, and constantly receive the acknowledgement (ACK) as feedback about the packet transmission qualities from the receiver. When the data sending rate overwhelms the link bandwidth, congestion occurs and packets will be queued in buffers or dropped at congested nodes. Congestion increases the delay to receive the packets and thus degrades the overall transmission rates.

With a congestion control model, the sender maintains a dynamic congestion window to limit the maximum number of bytes that can be sent out in response to every ACK feedback. For example, the CUBIC model Ha et al. (2016) uses the following growth function to determine the real-time congestion window W(t)W(t):

W(t)=C(tK)3+Wmax,K=WmaxβC3W(t)=C(t-K)^{3}+W_{max},K=\sqrt[3]{{W_{max}\beta\over{C}}} (10)

where WmaxW_{max} is the window size after the last congestion event (i.e., packet loss/dropping is detected from ACK feedback); CC and β\beta are fixed constants; tt is the elapsed time since the last window reduction.

Eq. 10 is adjusted every Round Trip Time (RTT) between the sender and receiver. Each RTT is normally a subsecond value. The quick adjustment avoids consistent congestion, but also updates the sending rate of a flow vary frequently. The cumulative loading volume (or sending rates) of packets in the history determines the probability of packet loss. Such connection makes it possible to approximate the sending rules in Eq. 10 based on historical traces of sending rates of all flows. This explains the necessity of including the PathAggregator in FlowNN to approximate the decision process of the congestion control model.

Once the allowed sending bytes/rates are determined by the congestion control model, the sender will encapsulate the allowed number of data bytes into packets and send them out sequentially. As shown in Fig. 6, all packets will be forwarded hop-by-hop along a routing path until reaching the final destination.

Appendix B Relational Inductive Bias of Different Models

Many approaches in machine learning use a relational inductive bias Peter et al. (2018) to impose constraints on relationships and interactions among entities in a learning process. Relational bias can be understood in terms of the bias-variance tradeoff Geman et al. (1992), which helps improve solutions to generalize in a desirable way. However, an improper relational bias will lead to suboptimal performance. The elementary building blocks and their compositions within FlowNN and the compared baselines induce various relational inductive biases.

B.0.1 Recurrent and convolutional layers in baselines

The popular recurrent and/or convolutional layers are used in the compared baselines. For example, with a convolutional layer, STHGCN and STGCN are actually imposing spatial locality as the inductive bias. Locality indicates correlations between entities in close proximity with one another in the input signal’s coordinate space, isolated from distant entities Peter et al. (2018). Similarly, the recurrent units in GRUs also incorporate a bias for temporal locality in the sequence. Therefore, with these building blocks, all these baselines make predictions with all entities in both spatially and temporally close proximity in the time-series tensor. This actually forms a message propagation routine that is depicted in Fig. 7(a). That is, the messages are first aggregated along the spatial direction (i.e., vertical) and then further aggregated along the time domain (i.e., horizontal).

B.0.2 PathAggregator and Induction layers in FlowNN

With PathAggregator, FlowNN only relates (predicts) the states of source node h^1t\hat{h}_{1}^{t} with all historical information of {h^nτ}n=1,,L\{\hat{h}_{n}^{\tau}\}_{n=1,\cdots,L}, τ<t\tau<t. However, to predict states of spatial nodes n=2,,Ln=2,\cdots,L, the Induction layer only relates the spatial information of h^n1τ\hat{h}_{n-1}^{\tau} at its correlated time window. This filters out the irrelevant entities at the space-time coordinates that otherwise will be included as undesired noise. The blue curve in Fig. 7(b) shows the routine to propagate the transmission information along the space-time coordinates.

Fig. 7 compares the relational biases represented by different models. FlowNN attends to the physical inductive bias that matches the natural generating behaviors of the time-series tensor. In contrast, none of the baselines incorporate this inductive bias.

Refer to caption
Figure 7: (a) Relational bias in baselines; (b) relational bias in FlowNN

Appendix C Stop-Gradient for FlowNN

Refer to caption
Figure 8: Self-supervised FlowNN training architecture.

Let’s consider a loss function of the following form:

(η,θ)=𝔼X[(η,θ)(X)𝔼X[fL1η(X)]22]\mathcal{L}(\eta,\theta)=\mathbb{E}_{X}\Big{[}\big{\|}\mathcal{F}^{(\eta,\theta)}(X)-\mathbb{E}_{X}\big{[}f_{L_{1}}^{\eta}(X)\big{]}\big{\|}_{2}^{2}\Big{]} (11)

where (η,θ)\mathcal{F}^{(\eta,\theta)} is a FlowNN with parameters (η,θ)(\eta,\theta) for layer fL1ηf_{L_{1}}^{\eta} and layer fpθf_{p}^{\theta} (i.e., L2L_{2} and L3L_{3}) in Fig. 8; XX is a sample of the input time-series tensor.

Eq. 11 is equivalent to the negative cosine similarity loss in the main text if all vectors are l2l_{2}-normalized. With this formulation, the training process updates FlowNN at each step so that the embedded physical logics by FlowNN remains stable for all possible inputs XX. This is optimized by solving the following problem w.r.t η\eta and θ\theta:

minη,θ(η,θ)\min_{\eta,\theta}\mathcal{L}(\eta,\theta) (12)

The variable θ\theta is learnt in FlowNN to make predictions based on physical logics. The variable η\eta is optimized to represent the data trending information manifested within the input XX by fL1f_{L_{1}}. Formally, the problem in Eq. 12 can be solved by alternatively solving the following two subproblems:

ηtargminη(η,θt1)\displaystyle\eta^{t}\leftarrow\arg\min_{\eta}\mathcal{L}(\eta,\theta^{t-1}) (13)
θtargminθ(ηt,θ)\displaystyle\theta^{t}\leftarrow\arg\min_{\theta}\mathcal{L}(\eta^{t},\theta) (14)

where tt is the index of alternation and \leftarrow means assigning.

Subproblem (13) attempts to optimize the representation towards the input under a constant fpθt1f_{p}^{\theta^{t-1}}. The stop-gradient operation in Fig. 8 is a natural implementation, since the gradient does not back-propagate to θt1\theta^{t-1} (a constant) in this subproblem. By definition of Eq. 11, ηt\eta^{t} is expected to minimize the average representations over any possible input XX. In practice, it is unrealistic to actually compute the expectation 𝔼X\mathbb{E}_{X}. Similar to Chen and He (2021), we alternatively apply a neural network fmf_{m} (a MLP projection in the experiments) in Fig. 8 to predict the expectation. With the second term of the defined symmetrized loss in the main text, subproblem (14) will be solved to optimize the learned physical bias from FlowNN. Therefore, the two subproblems can be solved by one-step gradient update through such training structure.

Although the above analysis is a similar application of stop-gradient operation as Chen and He (2021); Grill et al. (2020), we want to highlight that the self-supervised training strategy for FlowNN is not a standard application of Chen and He (2021). Instead, we combined the advances of Contrastive Predictive Code (CPC) Oord et al. (2019) with a stop-gradient operation. More importantly, the goal of the designed loss is to improve the prediction quality by comparing the predicted hidden vectors with the ground-truth labels in latent space. The projector fmf_{m} receives as input the encoded information at tt, which however is not seen by Layer fpθf_{p}^{\theta} when generating the predicted hidden vectors at tt. Moreover, the projector fmf_{m} is a MLP, while fpθf_{p}^{\theta} contains GRU-type calculations. Consequently, fmf_{m} and fpθf_{p}^{\theta} will produce different outputs even when fL1ηf_{L_{1}}^{\eta} becomes a constant mapping. This implies that a constant mapping of fL1ηf_{L_{1}^{\eta}} will not lead to the minimization of the similarity loss. Therefore, there is no learning collapsing problem in the self-supervised learning architecture of Fig. 8. This work is significantly different from the solution to the collapsing problem and the goal of the Siamese design in Chen and He (2021), which aims at extracting similar features from two augmented views.

Appendix D Extra Experimental Results

Refer to caption
(a) Test MSE on NUMFabric
Refer to caption
(b) Test MSE on WIDE.
Refer to caption
(c) Test RSE on NUMFabric.
Refer to caption
(d) Test RSE on WIDE.
Refer to caption
(e) Test Corr on NUMFabric.
Refer to caption
(f) Test Corr on WIDE.
Figure 9: Prediction test performances on unseen flows under different steps ahead.

D.0.1 Multi-step-ahead rate predictions

Fig. 9 shows the test performances of different models on multi-step-ahead prediction tasks. Following the practice of second-granularity traffic forecasting in Mozo et al. (2018), we aim to predict the average values across next multiple steps. That is, we predict the value of 𝔼[xt+1,,xt+Δ]\mathbb{E}[x^{t+1},\cdots,x^{t+\Delta}], where Δ\Delta is the time length to predict. This information is crucial to many short-term networking planning and control tasks in communications networks. In this task, FlowNN is consistently best for different time predictions in the future. Note that for any Δ2\Delta\geq 2, it is not always harder or easier to predict under the different degrees of traffic burstiness created by the average operation. Therefore, it is reasonable to observe that 2-step ahead prediction is better than the one-step ahead case in Fig. 9.

D.0.2 Efficacy of different recurrent iterations NN

Table 5 presents the test results on unseen flows of dataset NUMFabric. N=2N=2 achieves the best results, but other values of NN do not degrade the performance remarkably. This shows that extra iterations can further improve FlowNN. Although FlowNN is robust across different NN’s, improperly setting NN will lead to very different computational complexity. Consequently, a smaller NN is preferred.

Table 5: Efficacy of different iterations NN within FlowNN.
NN MSE RSE Corr
1 0.0274±\pm0.0010 0.1711±\pm0.0020 0.9853±\pm0.0004
2 0.0261±\pm0.0009 0.1672±\pm0.0021 0.9860±\pm0.0004
3 0.0273±\pm0.0009 0.1710±\pm0.0021 0.9854±\pm0.0004
4 0.0286±\pm0.0010 0.1747±\pm0.0022 0.9849±\pm0.0004
5 0.0280±\pm0.0009 0.1731±\pm0.0022 0.9849±\pm0.0004

D.0.3 Computational complexity

The computational complexity of FlowNN depends on the length of routing paths. In the above experiments, we used several network topologies (fat tree, NSF, and GM50) with routing path lengths ranging from 2 to 7. Evaluated under all these different topologies, the reported maximum computation time for FlowNN is about 0.3ms with a single Nvidia P40 24GB GPU. Such speed is enough to support every milisecond prediction task. Moreover, many networking tasks only require predicting the average of the next few steps in practice, which further alleviates the requirement on computation speed when deploying FlowNN. The computation time depends on both the model complexity and the computing devices. We leave the evaluation on larger topologies to the future.

Appendix E Reproducibility

E.0.1 Details of datasets

NUMFabric is a synthetic dataset collected by the NS-3 based simulator555https://bitbucket.org/knagaraj/numfabric/src/master/ according to the configurations in Nagaraj et al. (2016). The original simulator is configured with a 3-layer fat-tree network topology. While we used the same simulator by replacing the network topology as NSFNet topology and Germany 50 nodes topology from the site666http://www.knowledgedefinednetworking.org/ to generate the synthetic datasets of NSF and GM50, respectively. Instead of synthetic traffic patterns, dataset WIDE is collected with the realistic packet traces from https://mawi.wide.ad.jp/mawi/.

For all datasets, we collected the real-time sending rates of 50 random flows, as well as the aggregated sending rates along the routing path at a time interval of 1ms1ms with a total running length of 30000msms. The sending rates are calculated by counting the packets received/transmitted every msms from corresponding flows at each node. For the first 40 flows, we split the time length of 30000ms30000ms into training, validation and test by ratio 6:2:2. Therefore, only historical data were exposed during training. The evaluation and test processes were both conducted with future data. Finally, we also tested the learnt model on the remaining 10 flows (unseen flows) that are never seen during training and evaluation.

E.0.2 Details of experiments

For all baselines, we conduct a grid search over the hyper-parameters. The hidden size is chosen among {16, 32, 64, 128, 256} for all datasets. For all applied GRUs, the number of GRU layers is chosen among {1, 2, 3, 4}. For FlowNN, the number NN of iterations is chosen among {1, 2, 3, 4, 5}. We used the library from https://www.dgl.ai/ for the implementations of GCN in STHGCN. For STGCN, we followed the implementations released in https://github.com/VeritasYin/STGCN_IJCAI-18.

The pseudocode implementation of FlowNN is described in Algorithm 1.

Algorithm 1 FlowNN pseudocode

Input: {xnτ}τ=t0,,t1,n=1,,L\{x_{n}^{\tau}\}_{\tau=t_{0},\cdots,t-1,n=1,\cdots,L} # note: xr,nτxnτx_{r,n}^{\tau}\in x_{n}^{\tau}
Output: {h^nτ}τ=t,n=1,,L\{\hat{h}_{n}^{\tau}\}_{\tau=t,n=1,\cdots,L}   #predicted representation

1:hnτ=fL1(xnτ),τ=t0,,t1;n=1,,Lh_{n}^{\tau}=f_{L_{1}}(x_{n}^{\tau}),\forall\tau=t_{0},\cdots,t-1;n=1,\cdots,L
2:for all NN recurrent iterations do
3:     #PathAggregator
4:h^1τ=fL2({hnτ}n=1,,L),τ=t0,,t1\hat{h}_{1}^{\tau}=f_{L_{2}}(\{h_{n}^{\tau}\}_{n=1,\cdots,L}),\forall\tau=t_{0},\cdots,t-1
5:     for all neighboring pairs n,n+1n=1,,L1\langle n,n+1\rangle_{n=1,\cdots,L-1} do
6:T1i,T2ii=1,2,=split({xr,nτxr,n+1τ}τ=t0,,t1)\langle T_{1}^{i},T_{2}^{i}\rangle_{i=1,2,\cdots}=\mathrm{\textbf{{split}}}(\{x_{r,n}^{\tau}-x_{r,n+1}^{\tau}\}_{\tau=t_{0},\cdots,t-1})
7:{h^n+1τ}τ=t1,,t=Induction(T1i,T2ii=1,2,,\{\hat{h}_{n+1}^{\tau}\}_{\tau=t_{1},\cdots,t}=\mathrm{\textbf{{Induction}}}\big{(}\langle T_{1}^{i},T_{2}^{i}\rangle_{i=1,2,\cdots},
8:         {h^1τ,hn+1τ}τ=t0,,t1)\quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ \{\hat{h}_{1}^{\tau},h_{n+1}^{\tau}\}_{\tau=t_{0},\cdots,t-1}\big{)}
9:     end for
10:end for
11: 
12:Def split()\mathrm{\textit{split}}\big{(}\big{)}
13:  Input: {xr,nτxr,n+1τ}τ=t0,,t1\{x_{r,n}^{\tau}-x_{r,n+1}^{\tau}\}_{\tau=t_{0},\cdots,t-1}
14:set correlation window index i=0,T1i=T2i=[]i=0,T_{1}^{i}=T_{2}^{i}=[~{}]
15:for all τ=t0,,t1\tau=t_{0},\cdots,t-1 do
16:     if xr,nτxr,n+1τ0x_{r,n}^{\tau}-x_{r,n+1}^{\tau}\geq 0  then
17:         append τ\tau to T1iT_{1}^{i}
18:     end if
19:     if xr,nτxr,n+1τ<0x_{r,n}^{\tau}-x_{r,n+1}^{\tau}<0 then
20:         append τ\tau to Ti2T_{i}^{2}
21:     end if
22:     if xr,nτxr,n+1τ<0x_{r,n}^{\tau}-x_{r,n+1}^{\tau}<0 and xr,nτxr,n+1τ0x_{r,n}^{\tau}-x_{r,n+1}^{\tau}\geq 0 then
23:         i=i+1,T1i=T2i=[]i=i+1,T_{1}^{i}=T_{2}^{i}=[~{}]
24:     end if
25:end for
26:return T1i,T2ii=1,2,\langle T_{1}^{i},T_{2}^{i}\rangle_{i=1,2,\cdots}
27: 
28:Def Induction()\mathrm{\textit{Induction}}\big{(}\big{)}
29:  Input: T1i,T2ii=1,2,,{h^1τ,hn+1τ}τ=t0,,t1\langle T_{1}^{i},T_{2}^{i}\rangle_{i=1,2,\cdots},\{\hat{h}_{1}^{\tau},h_{n+1}^{\tau}\}_{\tau=t_{0},\cdots,t-1}
30:h^nt=fp(h^1t),t{T1i,T2i}\hat{h}_{n}^{t}=f_{p}(\hat{h}_{1}^{t}),\forall t\in\{T_{1}^{i},T_{2}^{i}\}
31:for each i=1,2,i=1,2,\cdots do
32:     h^n+1τ=MLP({h^nτ||hn+1τ}),τT1i\hat{h}^{\tau}_{n+1}=MLP\big{(}\{\hat{h}_{n}^{\tau}~{}||~{}h_{n+1}^{\tau}\}\big{)},\forall\tau\in T_{1}^{i}
33:     h^n+1τ=Seq2Seq(hn+1τ1|{h^nt}tT1i),τT2i\hat{h}_{n+1}^{\tau}=Seq2Seq\big{(}h_{n+1}^{\tau-1}~{}|~{}\{\hat{h}_{n}^{t}\}_{t\in T_{1}^{i}}\big{)},\forall\tau\in T_{2}^{i}
34:end for
35:return {h^n+1τ}τ=t1,,t\{\hat{h}^{\tau}_{n+1}\}_{\tau=t_{1},\cdots,t}

Appendix F Real-world Deployment

Path-wide data measurements: As we explained in the definition of Packet Flow in Section Preliminaries of the main text, the features we analyzed in our model are xf,ntx_{f,n}^{t}\in\mathbb{R}, which can be measured by telemetry monitors installed at each node. In the experiments, we collected the average packet transmission/receiving rates of each monitored flow at the nodes on its routing path to construct xf,ntx_{f,n}^{t}, and predicted the corresponding transmission/receiving rates that a node will experience, and also the end-to-end transmission delay. Such prediction can be performed for each individual flow or aggregated flows that share the same routing path (including the source and destination nodes). There are many telemetry frameworks that support either per-flow or aggregated flow measurements, like INT Tan and et al. (2021) and Sketch Yan and et al. (2020).

Real-world deployment: In a real-world networking environment, the FlowNN model can be deployed at the end host/network interface controller (NIC) or a central controller. The real-time measurement can be collected by the northbound interface or ACK signals. Considering the deployment cost, the FlowNN model can be deployed to serve only selected flows that we really want to optimize, such as the networking flows from important clients. Other flows that only enjoy the best-effort services can be treated as background flows. This meets the commercial practice of network operators.

References

  • Apice et al. [2008] C. D’ Apice, R. Manzo, and B. Piccoli. A fluid dynamic model for telecommunication networks with sources and destinations. SIAM J. Appl. Math, 68:981–1003, 2008.
  • Benson et al. [2010] Theophilus Benson, Aditya Akella, and David A. Maltz. Network traffic characteristics of data centers in the wild. IMC, pages 1–14, 2010.
  • Bernardez et al. [2021] G. Bernardez, J. Suarez-Varela, A. Lopez, B. Wu, S. Xiao, X. Cheng, P. Barlet-Ros, and A. Cabellos-Aparicio. Is machine learning ready for traffic engineering optimization? International Conference on Network Protocols (ICNP), 2021.
  • Chen and He [2021] X. Chen and K. He. Exploring simple siamese representation learning. CVPR, 2021.
  • Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning Workshop, 2014.
  • Ergenc [2019] Doganalp Ergenc. On network traffic forecasting using autoregressive models. arXiv preprint arXiv: 1912.12220v1, 2019.
  • Gebali [2015] F. Gebali. Modeling network traffic, pages 445–492. Springer International Publishing, Cham, 2015.
  • Geman et al. [1992] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 1992.
  • Grill et al. [2020] J.-B. Grill, F. Strub, F. Altche, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS, 2020.
  • Ha et al. [2016] S. Ha, I. Rhee, and L. Xu. Cubic: a new tcp-friendly high-speed tcp variant. SIGOPS-OSR, 2016.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. NeurIPS, 2016.
  • Lei et al. [2019] Kai Lei, Meng Qin, Bo Bai, Gong Zhang, and Min Yang. Gcn-gan: A non-linear temporal link prediction model for weighted dynamic networks. IEEE INFOCOMM, 2019.
  • Marcus et al. [2020] K Marcus, Z. Min, Z Chengzhi, Y. Hanling, and P. Lujia. Spatio-temporal hybrid graph convolutional network for traffic forecasting in telecommunication networks. arXiv:2009.09849v1, 2020.
  • Mozo et al. [2018] A. Mozo, B. Ordozgoiti, and S. GoÂmez-Canaval. Forecasting short-term data center network traffic load with convolutional neural networks. PLoS ONE, 13(2):1–31, 2018.
  • Nagaraj et al. [2016] Kanthi Nagaraj, Dinesh Bharadia, Hongzi Mao, Sandeep Chinchali, Mohammad Alizadeh, and Sachin Katti. Numfabric: Fast and flexible bandwidth allocation in datacenters. ACM SIGCOMM, pages 188–201, 2016.
  • Nathan et al. [2019] J. Nathan, H. R. Noga, G. P.Brighten, S. Michael, and T. Aviv. A deep reinforcement learning perspective on internet congestion control. Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Oord et al. [2019] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748v2, 2019.
  • Peter et al. [2018] W. B. Peter, B. H. Jessica, B. Victor, S. Alvaro, Z. Vinicius, and M. Mateusz. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • Rusek et al. [2019] Krzysztof Rusek, José Suarez-Varela, Paul Almasan, Pere Barlet-Ros, and Albert Cabellos-Aparicio. Routenet: leveraging graph neural networks for network modeling and optimization in sdn. arXiv preprint arXiv:1910.01508, pages 1–12, 2019.
  • Sutskever et al. [2014] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NeurIPS, pages 3104–3112, 2014.
  • Tan and et al. [2021] Li Tan and et al. In-band network telemetry: a survey. Computer Networks, 186(26), 2021.
  • UPC [2020] UPC. http://www.knowledgedefinednetworking.org/, 2020.
  • Wang et al. [2006] Xiaozhe Wang, Kate Smith-Miles, and Rob J Hyndman. Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3):335–364, 2006.
  • WIDE [2021] WIDE. https://mawi.wide.ad.jp/mawi/, 2021.
  • Xiao et al. [2018] Shihan Xiao, Dongdong He, and Zhibo Gong. Deep-q: Traffic-driven qos inference using deep generative network. Proceedings of the 2018 Workshop on Network Meets AI & ML, pages 67–73, 2018.
  • Yan and et al. [2020] Y Yan and et al. Priority-aware per-flow measurement using cuckoo sketch. IEEE IFIP Networking Conference (Networking), pages 622–624, 2020.
  • Yu et al. [2018] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. IJCAI, 2018.