This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Matrix Factorization Based Network Embedding Method for DNS Analysis

Meng Qin Independent Researcher
Abstract

In this paper, I explore the potential of network embedding (a.k.a. graph representation learning) to characterize DNS entities in passive network traffic logs. I propose an MF-DNS-E (Matrix-Factorization-based DNS Embedding) method to represent DNS entities (e.g., domain names and IP addresses), where a random-walk-based matrix factorization objective is applied to learn the corresponding low-dimensional embeddings.

Index Terms:
Matrix Factorization, Network Embedding, DNS Analysis

I Methodology

I introduce an MF-DNS-E (Matrix-Factorization-based DNS Embedding) method to automatically learn the low-dimensional vector representations (i.e., embeddings) for domain names and IP addresses, which can be used to support some DNS security analysis tasks including malicious domain detection and IP reputation evaluation.

Le NN and MM be the total numbers of (i) domain names and (ii) IP addresses in a passive network traffic log, respectively. One can extract a bipartite graph ๐โˆˆโ„ค+Nร—M{\bf{B}}\in{\mathbb{Z}_{+}^{N\times M}}, with ๐iโ€‹j{{\bf{B}}_{ij}} as the number of queries that a domain name pi{p_{i}} is resolved to an IP address qj{q_{j}}. The corresponding adjacency matrix of this bipartite graph is

๐:=[๐ŸŽNร—M๐^๐^T๐ŸŽMร—N],{\bf{P}}:=\left[{\begin{array}[]{*{20}{c}}{\bf{0}}_{N\times M}&{{\bf{\hat{B}}}}\\ {{\bf{\hat{B}}}^{T}}&{\bf{0}}_{M\times N}\end{array}}\right], (1)

where each entry in ๐{\bf{B}} is normalized into [0,1][0,1] via ๐^iโ€‹j=๐iโ€‹jโ€‹/โ€‹๐max{{{\bf{\hat{B}}}}_{ij}}={{{{\bf{B}}_{ij}}}\mathord{\left/{\vphantom{{{{\bf{B}}_{ij}}}{{{\bf{B}}_{\max}}}}}\right.\kern-1.2pt}{{{\bf{B}}_{\max}}}}, with ๐max{\bf{B}}_{\max} as the maximum entry.

For all the domain names {pi}\{{p_{i}}\}, I further introduce (i) host-based and (ii) IP-based similarities between each pair of domain names, which are denoted as ๐’DHโˆˆโ„Nร—N{{\bf{S}}_{{\rm{DH}}}}\in{\mathbb{R}^{N\times N}} and ๐’DIโˆˆโ„Mร—M{{\bf{S}}_{{\rm{DI}}}}\in{\mathbb{R}^{M\times M}}, respectively. Given a domain name pi{p_{i}}, let Hpi{H_{p_{i}}} be the set of hosts that have the DNS query result of pi{p_{i}}. I further denote Ipi{I_{p_{i}}} as the set of IPs resolved from pi{p_{i}}. ๐’DH{{\bf{S}}_{{\rm{DH}}}} and ๐’DI{{\bf{S}}_{{\rm{DI}}}} are defined as:

(๐’DH)iโ€‹j:=|HpiโˆฉHpj||HpiโˆชHpj|,(๐’DI)iโ€‹j:=|IpiโˆฉIpj||IpiโˆชIpj|.{({{\bf{S}}_{{\rm{DH}}}})_{ij}}:=\frac{{|{H_{{p_{i}}}}\cap{H_{{p_{j}}}}|}}{{|{H_{{p_{i}}}}\cup{H_{{p_{j}}}}|}},{({{\bf{S}}_{{\rm{DI}}}})_{ij}}:=\frac{{|{I_{{p_{i}}}}\cap{I_{{p_{j}}}}|}}{{|{I_{{p_{i}}}}\cup{I_{{p_{j}}}}|}}. (2)

For all the IP addresses {qj}\{{q_{j}}\}, I also define the (iii) host-based and (iv) domain-based similarities, denoted as ๐’IHโˆˆโ„Mร—M{{\bf{S}}_{{\rm{IH}}}}\in{\mathbb{R}^{M\times M}} and ๐’IDโˆˆโ„Mร—M{{\bf{S}}_{{\rm{ID}}}}\in{\mathbb{R}^{M\times M}}, respectively. Given an IP address qj{q_{j}}, let Hqj{H_{{q_{j}}}} and Dqj{D_{{q_{j}}}} be the (i) set of hosts requesting the DNS result of qj{q_{j}} and (ii) set of domains resolved to qj{q_{j}}. ๐’IH{{\bf{S}}_{\rm{IH}}} and ๐’ID{{\bf{S}}_{\rm{ID}}} are defined as:

(๐’IH)iโ€‹j:=|HqiโˆฉHqj||HqiโˆชHqj|,(๐’ID)iโ€‹j:=|DqiโˆฉDqj||DqiโˆชDqj|.{({{\bf{S}}_{{\rm{IH}}}})_{ij}}:=\frac{{|{H_{{q_{i}}}}\cap{H_{{q_{j}}}}|}}{{|{H_{{q_{i}}}}\cup{H_{{q_{j}}}}|}},{({{\bf{S}}_{{\rm{ID}}}})_{ij}}:=\frac{{|{D_{{q_{i}}}}\cap{D_{{q_{j}}}}|}}{{|{D_{{q_{i}}}}\cup{D_{{q_{j}}}}|}}. (3)

Furthermore, I combine {๐’DH,๐’DI,๐’IH,๐’ID}\{{{\bf{S}}_{{\rm{DH}}}},{{\bf{S}}_{{\rm{DI}}}},{{\bf{S}}_{{\rm{IH}}}},{{\bf{S}}_{{\rm{ID}}}}\} and ๐{\bf{P}} to construct a similarity-enhanced graph described by the following adjacency matrix:

๐’:=[(๐’DH+๐’DI)โ€‹/โ€‹2๐^๐^T(๐’IH+๐’ID)โ€‹/โ€‹2],{\bf{S}}:=\left[{\begin{array}[]{*{20}{c}}{{{({{\bf{S}}_{{\rm{DH}}}}+{{\bf{S}}_{{\rm{DI}}}})}\mathord{\left/{\vphantom{{({{\bf{S}}_{{\rm{DH}}}}+{{\bf{S}}_{{\rm{DI}}}})}2}}\right.\kern-1.2pt}2}}&{{\bf{\hat{B}}}}\\ {{{{\bf{\hat{B}}}}^{T}}}&{{{({{\bf{S}}_{{\rm{IH}}}}+{{\bf{S}}_{{\rm{ID}}}})}\mathord{\left/{\vphantom{{({{\bf{S}}_{{\rm{IH}}}}+{{\bf{S}}_{{\rm{ID}}}})}2}}\right.\kern-1.2pt}2}}\end{array}}\right], (4)

where all the elements in ๐’{\bf{S}} are within the value range [0,1][0,1].

The KK-order (Kโ‰ฅ1K\geq 1) proximity of a node vi{v_{i}} can be used to explore implicit community structures of graphs [1, 2, 3, 4, 5, 6]. I apply the matrix-factorization-based objective [7] of DeepWalk [8] to ๐’{\bf{S}}. Let ๐€=๐’{\bf{A}}={\bf{S}} be the adjacency matrix of the heterogeneous graph. This objective aims to factorize the following matrix:

๐Œ:=logโก(wโ€‹(1Kโ€‹โˆ‘k=1K๐ƒโˆ’1โ€‹๐’)kโ€‹๐ƒโˆ’1)โˆ’logโกb,{\bf{M}}:=\log(w{(\frac{1}{K}\sum\nolimits_{k=1}^{K}{{{\bf{D}}^{-1}}{\bf{S}}})^{k}}{{\bf{D}}^{-1}})-\log b, (5)

where ๐ƒ=diag(degโก(v1),โ‹ฏ,degโก(vN+M)){\bf{D}}={\mathop{\rm diag}\nolimits}({{\deg}(v_{1})},\cdots,{{\deg}(v_{N+M})}) is the degree diagonal matrix w.r.t. ๐’{\bf{S}}; w=โˆ‘i=1N+Mdegโก(vi)w=\sum\nolimits_{i=1}^{N+M}{{{\deg}(v_{i})}} is defined as the volume of a graph; KK and bb are pre-set proximity order and number of negative samples.

The low-dimensional embeddings are then derived by fitting the random-walk-based transition probabilities to the sampled random walks equivalent to the following objective [7]:

argโกmin๐—,๐˜OGE(๐—,๐˜)=โ€–๐Œโˆ’๐—๐˜Tโ€–F2,\mathop{\arg\min}\limits_{{\bf{X}},{\bf{Y}}}{{\mathop{\rm O}\nolimits}_{{\rm{GE}}}}({\bf{X}},{\bf{Y}})=\left\|{{\bf{M}}-{\bf{X}}{{\bf{Y}}^{T}}}\right\|_{F}^{2}, (6)

The singular value decomposition (SVD) is then used to get the solution {๐—โˆ—,๐˜โˆ—}\{{{\bf{X}}^{*}},{{\bf{Y}}^{*}}\} via

๐—โˆ—=๐”:,1:dโ€‹๐šบ1:d,๐˜โˆ—=๐•:,1:dโ€‹๐šบ1:d,{{\bf{X}}^{*}}={{\bf{U}}_{:,1:d}}\sqrt{{{\bf{\Sigma}}_{1:d}}},{{\bf{Y}}^{*}}={{\bf{V}}_{:,1:d}}\sqrt{{{\bf{\Sigma}}_{1:d}}}, (7)

with ๐šบ=diagโ€‹(ฮธ1,ฮธ2,โ‹ฏ,ฮธN+M){\bf{\Sigma}}={\rm{diag}}({\theta_{1}},{\theta_{2}},\cdots,{\theta_{N+M}}) as the diagonal matrix of singular values in descending order. Finally, I adopt ๐—โˆ—{\bf{X}}^{*} as the derived network embeddings, where I treat ๐—1:N,:โˆ—{\bf{X}}_{1:N,:}^{*} and ๐—(N+1):(N+M),:โˆ—{\bf{X}}_{(N+1):(N+M),:}^{*} as the domain name embeddings {๐ฎ1,โ‹ฏ,๐ฎN}\{{{\bf{u}}_{1}},\cdots,{{\bf{u}}_{N}}\} and IP address embeddings {๐ฏ1,โ‹ฏ,๐ฏM}\{{{\bf{v}}_{1}},\cdots,{{\bf{v}}_{M}}\}, respectively.

In addition, I also incorporate two logistic regression classifiers (i.e., CMDD{C_{{\rm{MDD}}}} and CIRE{C_{{\rm{IRE}}}}) to the unsupervised embedding objective (6) to support two DNS analysis tasks of (i) malicious domain detection and (ii) IP reputation evaluation.

Let R={rl1,โ‹ฏ,rlT}R=\{{r_{{l_{1}}}},\cdots,{r_{{l_{T}}}}\} and G={gl1,โ‹ฏ,glT}G=\{{g_{{l_{1}}}},\cdots,{g_{{l_{T}}}}\} be (i) the result given by classifier and (ii) the ground-truth, respectively. The logistic regression classifier can be optimized by minimizing the following binary cross-entropy objective:

OC(G,R)=โˆ’โˆ‘i=l1lT(gilog(r)i+(1โˆ’gi)log(1โˆ’ri)).\leavevmode\resizebox{390.25534pt}{}{${{\mathop{\rm O}\nolimits}_{\rm{C}}}(G,R)=-\sum\nolimits_{i={l_{1}}}^{{l_{T}}}{({g_{i}}\log(r{}_{i})+(1-{g_{i}})\log(1-{r_{i}}))}$}. (8)

To fully utilize the supervised training labels, I further apply the graph regularization technique. A constraint matrix ๐‚โˆˆ{0,1}(N+M)ร—(N+M){\bf{C}}\in{\{0,1\}^{(N+M)\times(N+M)}} is first introduced to represent the training label information. For a pair of domain names (pi,pj)({p_{i}},{p_{j}}) in the training set, let ๐‚iโ€‹j=๐‚jโ€‹i=1{{\bf{C}}_{ij}}={{\bf{C}}_{ji}}=1 if (pi,pj)({p_{i}},{p_{j}}) are both malicious (or benign) domain names and ๐‚iโ€‹j=๐‚jโ€‹i=0{{\bf{C}}_{ij}}={{\bf{C}}_{ji}}=0 otherwise. Similarly, for a pair of IP addresses (qi,qj)({q_{i}},{q_{j}}) in the training set, let ๐‚(N+i),(N+j)=๐‚(N+j),(N+i)=1{{\bf{C}}_{(N+i),(N+j)}}={{\bf{C}}_{(N+j),(N+i)}}=1 if (pi,pj)({p_{i}},{p_{j}}) have the same reputation level and ๐‚(N+i),(N+j)=๐‚(N+j),(N+i)=0{{\bf{C}}_{(N+i),(N+j)}}={{\bf{C}}_{(N+j),(N+i)}}=0 otherwise. For a given domain name pi{p_{i}} and an IP address qj{q_{j}} in the training sets, let ๐‚i,(N+j)=๐‚(N+j),i=1{{\bf{C}}_{i,(N+j)}}={{\bf{C}}_{(N+j),i}}=1 if if pi{p_{i}} is a benign (or malicious) domain name and qj{q_{j}} is with normal (or poor) reputation and ๐‚i,(N+j)=๐‚(N+j),i=0{{\bf{C}}_{i,(N+j)}}={{\bf{C}}_{(N+j),i}}=0 otherwise. Finally, I set ๐‚i,:=๐‚:,i=0.5{{\bf{C}}_{i,:}}={{\bf{C}}_{:,i}}=0.5 and ๐‚(i+N),:=๐‚:,(i+N)=0.5{{\bf{C}}_{(i+N),:}}={{\bf{C}}_{:,(i+N)}}=0.5 for the rest domain names and IP addresses in the validation and test sets. Based on ๐‚{\bf{C}}, the graph regularization objective is then defined as:

OR(๐—,๐‚)=12โ€‹โˆ‘i,j๐‚iโ€‹jโ€‹โ€–๐—i,:โˆ’๐—j,:โ€–22=tr(๐—Tโ€‹๐‹๐—),\leavevmode\resizebox{390.25534pt}{}{${{\mathop{\rm O}\nolimits}_{\rm{R}}}({\bf{X}},{\bf{C}})=\frac{1}{2}\sum\nolimits_{i,j}{{{\bf{C}}_{ij}}\left\|{{{\bf{X}}_{i,:}}-{{\bf{X}}_{j,:}}}\right\|_{2}^{2}={\mathop{\rm tr}\nolimits}({{\bf{X}}^{T}}{\bf{LX}})}$}, (9)

with ๐‹:=๐ƒCโˆ’๐‚{\bf{L}}:={{\bf{D}}_{\mathop{\rm C}\nolimits}}-{\bf{C}} as the Laplacian matrix of ๐‚{\bf{C}} and ๐ƒC=diag(โˆ‘j๐‚1โ€‹j,โ‹ฏ,โˆ‘j๐‚(N+M),j){{\bf{D}}_{\rm{C}}}={\mathop{\rm diag}\nolimits}(\sum\nolimits_{j}{{{\bf{C}}_{1j}},\cdots,\sum\nolimits_{j}{{{\bf{C}}_{(N+M),j}}}}) as the degree diagonal matrix w.r.t. ๐‚{\bf{C}}.

In addition to (9), I also develop an end-to-end logistic regression classifier to integrate the supervised training labales. Let {๐ฑi}={๐—i,:=๐ฎi}โˆช{๐—N+i,:=๐ฏi}\{{{\bf{x}}_{i}}\}=\{{{\bf{X}}_{i,:}}={{\bf{u}}_{i}}\}\cup\{{{\bf{X}}_{N+i,:}}={{\bf{v}}_{i}}\} to be the set of available DNS entities, including all the domain names and IP addresses. The logistic regression classifier takes each embedding ๐ฑi{{\bf{x}}_{i}} as its input and then derives a classification result ri=ฯƒโ€‹(๐ฑiโ€‹๐ฐ+๐›){r_{i}}=\sigma({{\bf{x}}_{i}}{\bf{w}}+{\bf{b}}), with {๐ฐ,๐›}\{{\bf{w}},{\bf{b}}\} as learnable parameters. Let R={r1,โ‹ฏ,rN+M}R=\{{r_{1}},\cdots,{r_{N+M}}\} and G={g1,โ‹ฏ,gN+M}G=\{{g_{1}},\cdots,{g_{N+M}}\} be (i) the classification result and (ii) the ground-truth w.r.t. DNS entities {p1,โ‹ฏ,pN,q1,โ‹ฏ,qM}\{{p_{1}},\cdots,{p_{N}},{q_{1}},\cdots,{q_{M}}\}. I adopt the following binary cross-entropy objective between GG and RR:

OAC(G,R)=โˆ’โˆ‘imiโ€‹(giโ€‹logโกri+(1โˆ’gi)โ€‹logโก(1โˆ’ri)),{{\mathop{\rm O}\nolimits}_{{\rm{AC}}}}(G,R)=-\sum\nolimits_{i}{{m_{i}}({g_{i}}\log{r_{i}}+(1-{g_{i}})\log(1-{r_{i}}))},

(10)

where a mask mim_{i} is integrated to distinguish DNS entities in the training set from those in the validation and test sets. Concretely, mi=1{m_{i}}=1 if a DNS entity xi{x_{i}} is in the training set and mi=0{m_{i}}=0 otherwise.

In summary, one can construct the following semi-supervised objective by combining objectives of the (i) unsupervised graph embedding objective (5), (ii) graph regularization objective (9), as well as (iii) auxiliary classifier (10):

argโกmin๐—,๐˜O(๐—,๐˜)=(OGE+ฮฑโ€‹OAC+ฮฒโ€‹OR),\mathop{\arg\min}\limits_{{\bf{X}},{\bf{Y}}}{\mathop{\rm O}\nolimits}({\bf{X}},{\bf{Y}})=({{\mathop{\rm O}\nolimits}_{{\rm{GE}}}}+\alpha{{\mathop{\rm O}\nolimits}_{{\rm{AC}}}}+\beta{{\mathop{\rm O}\nolimits}_{\rm{R}}}), (11)

where ฮฑ\alpha and ฮฒ\beta are hyper-parameters to balance OAC{\mathop{\rm O}\nolimits}_{\rm{AC}} and OR{\mathop{\rm O}\nolimits}_{\rm{R}}.

To obtain the solution of objective (11), I first randomly initialize parameters {๐—,๐˜,๐ฐ,๐›}\{{\bf{X}},{\bf{Y}},{\bf{w}},{\bf{b}}\} and then apply gradient descent to iteratively update their values until convergence. For the solution of (11) (notated as {๐—โ€ฒโˆ—,๐˜โ€ฒโˆ—}\{{{{\bf{X^{\prime}}}}^{*}},{{{\bf{Y^{\prime}}}}^{*}}\}), I use ๐—โ€ฒโˆ—{\bf{X^{\prime}}}^{*} as the final low-dimensional embeddings for the associated DNS entities. The classification results of malicious domain detection and IP reputation evaluation can then be derived from the end-to-end logistic regression classifier integrated.

II Conclusions

In this paper, I proposed a novel JDE method to (i) automatically learn the domain and IP embeddings by jointly exploring the homogeneous and heterogeneous high-order proximities between two types of DNS entities while (ii) simultaneously supporting malicious domain detection and IP reputation evaluation. In my future work, I intend to consider the joint optimization of other DNS entities and their associated applications (e.g., device type classification of end hosts). Moreover, to further explore the dynamic natures of DNS query behaviors via existing embedding techniques for dynamic graphs [9, 10, 11, 12] is also my next research focus.

References

  • [1] M.ย Qin, D.ย Jin, K.ย Lei, B.ย Gabrys, and K.ย Musial-Gabrys, โ€œAdaptive community detection incorporating topology and content in social networks,โ€ Knowledge-Based Systems, vol. 161, pp. 342โ€“356, 2018.
  • [2] M.ย Qin, K.ย Lei, B.ย Bai, and G.ย Zhang, โ€œTowards a profiling view for unsupervised traffic classification by exploring the statistic features and link patterns,โ€ in Proceedings of the 2019 ACM SIGOMM Workshop on Network Meets AI & ML, 2019, pp. 50โ€“56.
  • [3] W.ย Li, M.ย Qin, and K.ย Lei, โ€œIdentifying interpretable link communities with user interactions and messages in social networks,โ€ in Proceedings of the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), 2019, pp. 271โ€“278.
  • [4] M.ย Qin and K.ย Lei, โ€œDual-channel hybrid community detection in attributed networks,โ€ Information Sciences, vol. 551, pp. 146โ€“167, 2021.
  • [5] M.ย Qin, C.ย Zhang, B.ย Bai, G.ย Zhang, and D.-Y. Yeung, โ€œTowards a better trade-off between quality and efficiency of community detection: An inductive embedding method across graphs,โ€ ACM Transactions on Knowledge Discovery from Data (TKDD), vol.ย 17, no.ย 9, pp. 127:1โ€“127:34, 2023.
  • [6] Y.ย Gao, M.ย Qin, Y.ย Ding, L.ย Zeng, C.ย Zhang, W.ย Zhang, W.ย Han, R.ย Zhao, and B.ย Bai, โ€œRaftgp: Random fast graph partitioning,โ€ in 2023 IEEE High Performance Extreme Computing Conference (HPEC), 2023, pp. 1โ€“7.
  • [7] J.ย Qiu, Y.ย Dong, H.ย Ma, J.ย Li, K.ย Wang, and J.ย Tang, โ€œNetwork embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec,โ€ in Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018.
  • [8] B.ย Perozzi, R.ย Al-Rfou, and S.ย Skiena, โ€œDeepwalk: Online learning of social representations,โ€ in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2014.
  • [9] K.ย Lei, M.ย Qin, B.ย Bai, and G.ย Zhang, โ€œAdaptive multiple non-negative matrix factorization for temporal link prediction in dynamic networks,โ€ in Proceedings of the 2018 ACM SIGCOMM Workshop on Network Meets AI & ML, 2018, pp. 28โ€“34.
  • [10] K.ย Lei, M.ย Qin, B.ย Bai, G.ย Zhang, and M.ย Yang, โ€œGcn-gan: A non-linear temporal link prediction model for weighted dynamic networks,โ€ in Proceedings of the 2019 IEEE Conference on Computer Communications (INFOCOM), 2019, pp. 388โ€“396.
  • [11] M.ย Qin and D.-Y. Yeung, โ€œTemporal link prediction: A unified framework, taxonomy, and review,โ€ ACM Computing Surveys, vol.ย 56, no.ย 4, pp. 1โ€“40, 2023.
  • [12] M.ย Qin, C.ย Zhang, B.ย Bai, G.ย Zhang, and D.-Y. Yeung, โ€œHigh-quality temporal link prediction for weighted dynamic graphs via inductive embedding aggregation,โ€ IEEE Transactions on Knowledge & Data Engineering (TKDE), vol.ย 35, no.ย 9, pp. 9378โ€“9393, 2023.