A Matrix Factorization Based Network Embedding Method for DNS Analysis
Abstract
In this paper, I explore the potential of network embedding (a.k.a. graph representation learning) to characterize DNS entities in passive network traffic logs. I propose an MF-DNS-E (Matrix-Factorization-based DNS Embedding) method to represent DNS entities (e.g., domain names and IP addresses), where a random-walk-based matrix factorization objective is applied to learn the corresponding low-dimensional embeddings.
Index Terms:
Matrix Factorization, Network Embedding, DNS AnalysisI Methodology
I introduce an MF-DNS-E (Matrix-Factorization-based DNS Embedding) method to automatically learn the low-dimensional vector representations (i.e., embeddings) for domain names and IP addresses, which can be used to support some DNS security analysis tasks including malicious domain detection and IP reputation evaluation.
Le and be the total numbers of (i) domain names and (ii) IP addresses in a passive network traffic log, respectively. One can extract a bipartite graph , with as the number of queries that a domain name is resolved to an IP address . The corresponding adjacency matrix of this bipartite graph is
(1) |
where each entry in is normalized into via , with as the maximum entry.
For all the domain names , I further introduce (i) host-based and (ii) IP-based similarities between each pair of domain names, which are denoted as and , respectively. Given a domain name , let be the set of hosts that have the DNS query result of . I further denote as the set of IPs resolved from . and are defined as:
(2) |
For all the IP addresses , I also define the (iii) host-based and (iv) domain-based similarities, denoted as and , respectively. Given an IP address , let and be the (i) set of hosts requesting the DNS result of and (ii) set of domains resolved to . and are defined as:
(3) |
Furthermore, I combine and to construct a similarity-enhanced graph described by the following adjacency matrix:
(4) |
where all the elements in are within the value range .
The -order () proximity of a node can be used to explore implicit community structures of graphs [1, 2, 3, 4, 5, 6]. I apply the matrix-factorization-based objective [7] of DeepWalk [8] to . Let be the adjacency matrix of the heterogeneous graph. This objective aims to factorize the following matrix:
(5) |
where is the degree diagonal matrix w.r.t. ; is defined as the volume of a graph; and are pre-set proximity order and number of negative samples.
The low-dimensional embeddings are then derived by fitting the random-walk-based transition probabilities to the sampled random walks equivalent to the following objective [7]:
(6) |
The singular value decomposition (SVD) is then used to get the solution via
(7) |
with as the diagonal matrix of singular values in descending order. Finally, I adopt as the derived network embeddings, where I treat and as the domain name embeddings and IP address embeddings , respectively.
In addition, I also incorporate two logistic regression classifiers (i.e., and ) to the unsupervised embedding objective (6) to support two DNS analysis tasks of (i) malicious domain detection and (ii) IP reputation evaluation.
Let and be (i) the result given by classifier and (ii) the ground-truth, respectively. The logistic regression classifier can be optimized by minimizing the following binary cross-entropy objective:
(8) |
To fully utilize the supervised training labels, I further apply the graph regularization technique. A constraint matrix is first introduced to represent the training label information. For a pair of domain names in the training set, let if are both malicious (or benign) domain names and otherwise. Similarly, for a pair of IP addresses in the training set, let if have the same reputation level and otherwise. For a given domain name and an IP address in the training sets, let if if is a benign (or malicious) domain name and is with normal (or poor) reputation and otherwise. Finally, I set and for the rest domain names and IP addresses in the validation and test sets. Based on , the graph regularization objective is then defined as:
(9) |
with as the Laplacian matrix of and as the degree diagonal matrix w.r.t. .
In addition to (9), I also develop an end-to-end logistic regression classifier to integrate the supervised training labales. Let to be the set of available DNS entities, including all the domain names and IP addresses. The logistic regression classifier takes each embedding as its input and then derives a classification result , with as learnable parameters. Let and be (i) the classification result and (ii) the ground-truth w.r.t. DNS entities . I adopt the following binary cross-entropy objective between and :
|
(10) |
where a mask is integrated to distinguish DNS entities in the training set from those in the validation and test sets. Concretely, if a DNS entity is in the training set and otherwise.
In summary, one can construct the following semi-supervised objective by combining objectives of the (i) unsupervised graph embedding objective (5), (ii) graph regularization objective (9), as well as (iii) auxiliary classifier (10):
(11) |
where and are hyper-parameters to balance and .
To obtain the solution of objective (11), I first randomly initialize parameters and then apply gradient descent to iteratively update their values until convergence. For the solution of (11) (notated as ), I use as the final low-dimensional embeddings for the associated DNS entities. The classification results of malicious domain detection and IP reputation evaluation can then be derived from the end-to-end logistic regression classifier integrated.
II Conclusions
In this paper, I proposed a novel JDE method to (i) automatically learn the domain and IP embeddings by jointly exploring the homogeneous and heterogeneous high-order proximities between two types of DNS entities while (ii) simultaneously supporting malicious domain detection and IP reputation evaluation. In my future work, I intend to consider the joint optimization of other DNS entities and their associated applications (e.g., device type classification of end hosts). Moreover, to further explore the dynamic natures of DNS query behaviors via existing embedding techniques for dynamic graphs [9, 10, 11, 12] is also my next research focus.
References
- [1] M.ย Qin, D.ย Jin, K.ย Lei, B.ย Gabrys, and K.ย Musial-Gabrys, โAdaptive community detection incorporating topology and content in social networks,โ Knowledge-Based Systems, vol. 161, pp. 342โ356, 2018.
- [2] M.ย Qin, K.ย Lei, B.ย Bai, and G.ย Zhang, โTowards a profiling view for unsupervised traffic classification by exploring the statistic features and link patterns,โ in Proceedings of the 2019 ACM SIGOMM Workshop on Network Meets AI & ML, 2019, pp. 50โ56.
- [3] W.ย Li, M.ย Qin, and K.ย Lei, โIdentifying interpretable link communities with user interactions and messages in social networks,โ in Proceedings of the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), 2019, pp. 271โ278.
- [4] M.ย Qin and K.ย Lei, โDual-channel hybrid community detection in attributed networks,โ Information Sciences, vol. 551, pp. 146โ167, 2021.
- [5] M.ย Qin, C.ย Zhang, B.ย Bai, G.ย Zhang, and D.-Y. Yeung, โTowards a better trade-off between quality and efficiency of community detection: An inductive embedding method across graphs,โ ACM Transactions on Knowledge Discovery from Data (TKDD), vol.ย 17, no.ย 9, pp. 127:1โ127:34, 2023.
- [6] Y.ย Gao, M.ย Qin, Y.ย Ding, L.ย Zeng, C.ย Zhang, W.ย Zhang, W.ย Han, R.ย Zhao, and B.ย Bai, โRaftgp: Random fast graph partitioning,โ in 2023 IEEE High Performance Extreme Computing Conference (HPEC), 2023, pp. 1โ7.
- [7] J.ย Qiu, Y.ย Dong, H.ย Ma, J.ย Li, K.ย Wang, and J.ย Tang, โNetwork embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec,โ in Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018.
- [8] B.ย Perozzi, R.ย Al-Rfou, and S.ย Skiena, โDeepwalk: Online learning of social representations,โ in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2014.
- [9] K.ย Lei, M.ย Qin, B.ย Bai, and G.ย Zhang, โAdaptive multiple non-negative matrix factorization for temporal link prediction in dynamic networks,โ in Proceedings of the 2018 ACM SIGCOMM Workshop on Network Meets AI & ML, 2018, pp. 28โ34.
- [10] K.ย Lei, M.ย Qin, B.ย Bai, G.ย Zhang, and M.ย Yang, โGcn-gan: A non-linear temporal link prediction model for weighted dynamic networks,โ in Proceedings of the 2019 IEEE Conference on Computer Communications (INFOCOM), 2019, pp. 388โ396.
- [11] M.ย Qin and D.-Y. Yeung, โTemporal link prediction: A unified framework, taxonomy, and review,โ ACM Computing Surveys, vol.ย 56, no.ย 4, pp. 1โ40, 2023.
- [12] M.ย Qin, C.ย Zhang, B.ย Bai, G.ย Zhang, and D.-Y. Yeung, โHigh-quality temporal link prediction for weighted dynamic graphs via inductive embedding aggregation,โ IEEE Transactions on Knowledge & Data Engineering (TKDE), vol.ย 35, no.ย 9, pp. 9378โ9393, 2023.