A Novel Framework for the Analysis of Unknown Transactions in Bitcoin: Theory, Model, and Experimental Results

Maurantonio Caprolu1, Matteo Pontecorvi2, Matteo Signorini2, Carlos Segarra3 and Roberto Di Pietro1
1Division of Information and Computing Technology, College of Science and Engineering
Hamad Bin Khalifa University, Qatar Foundation - Doha, Qatar 2NOKIA Bell Labs - 91620 Nozay, France 3Imperial College, London, UK

Abstract

Bitcoin (BTC) is probably the most transparent payment network in the world, thanks to the full history of transactions available to the public. Though, Bitcoin is not a fully anonymous environment, rather a pseudonymous one, accounting for a number of attempts to beat its pseudonimity using clustering techniques. There is, however, a recurring assumption in all the cited deanonymization techniques: that each transaction output has an address attached to it. That assumption is false. An evidence is that, as of block height 591,872, there are several millions transactions with at least one output for which the Bitcoin Core client cannot infer an address.
In this paper, we present a novel approach based on sound graph theory for identifying transaction inputs and outputs. Our solution implements two simple yet innovative features: it does not rely on BTC addresses and explores all the transactions stored in the blockchain. All the other existing solutions fail with respect to one or both of the cited features. In detail, we first introduce the concept of Unknown Transaction and provide a new framework to parse the Bitcoin blockchain by taking them into account. Then, we introduce a theoretical model to detect, study, and classify—for the first time in the literature—unknown transaction patterns in the user network. Further, in an extensive experimental campaign, we apply our model to the Bitcoin network to uncover hidden transaction patterns within the Bitcoin user network. Results are striking: we discovered more than $30,000$ unknown transaction DAGs, with a few of them exhibiting a complex yet ordered topology and potentially connected to automated payment services. To the best of our knowledge, the proposed framework is the only one that enables a complete study of the unknown transaction patterns, hence enabling further research in the fields—for which we provide some directions.

I Introduction

Known as the first successful virtual currency with the potential to disrupt the banking system and provide peer-to-peer payments, Bitcoin has been widely adopted in ransomware campaigns such as Wannacry [40] and NotPetya [38]. The main reasons being the relative diffusion of Bitcoin, as well as privacy. Privacy is easily achievable with Bitcoin pseudonyms in the form of randomly generated addresses that can be used to send/receive money without being linked to any real identity.

Being organized into wallets, Bitcoin’s addresses can be easily and freely self-generated by end-users (neither banks nor trusted third parties are needed) without any limitation on their number. Indeed, using each address for a single transaction is a strongly advised common practice in the Bitcoin community and has always been its key component in providing privacy, since the resulting transactions’ network has always been assumed too complex to track money flows.

Such “privacy through complexity” approach has been enhanced in the last years by online services such as mixers and tumblers. Similar to the TOR project [19], aimed at concealing user’s location and network activities from anyone conducting network surveillance or traffic analysis, mixers and tumblers try to conceal users’ addresses that took part in some monetary transactions. The functioning of such services is quite simple and, similarly to TOR, they require to bounce bitcoins through peers in order to make their tracking hard. Furthermore, the bitcoins that are bounced and returned to the user are not the same that were initially sent, since they come from other sources (i.e. other addresses). However, during the last few years, it has been shown that the above “privacy through complexity” approach can be attacked by clustering the addresses into groups that are likely to belong to the same entity (a user, a shop, a mixer etc.). In 2015, David Nick described [34] some of the most famous heuristics being used still to date for Bitcoin address clustering, such as the shadow, consumer, optimal-change, and multi-input. These heuristics, applied to transactions data previously extracted from the blockchain using parsing algorithms, produce in output the clustered user network.

All such parsing algorithms suffer from the cognitive bias that Bitcoin transactions are just operations that link one address to another [27, 1]. However, such address-based linkability is not enforced directly in the Bitcoin protocol: the protocol only verifies that the locking and unlocking scripts do not produce any false statement [2]. This cited bias is also reinforced by the fact that most of the locking and unlocking scripts follow a specific pattern based on asymmetric encryption using randomly generated addresses. However, such a pattern is not mandatory. Consequently, transactions data parsed relying only on recognizable locking/unlocking scripts risk to be incomplete, and they often are. This incompleteness, in turn, causes a loss of reliability in all modern clustering and de-anonymization techniques. Moreover, intentionally crafted custom transactions could be used to hide illicit money flows, while being completely invisible to modern automatic parsers unable to decode output addresses. As an evidence, note that up to block with height 591,872, Bitcoin’s blockchain contains more than $22$ million transactions with at least one locking script (or output) not following any well-known locking/unlocking script. We will often refer to these outputs as unknown transaction outputs.

I-A Contribution

Our contributions are both theoretical and applied. At glance, we are the first, to the best of our knowledge, to provide a navigation tool within the Bitcoin blockchain that: does not rely on addresses and explores all the transactions stored in the blockchain. In detail, our contributions can be summarized as follow:

•

we introduce the general concept of unknown transactions, subsuming the less rigorous definition of non-standard transactions provided by the Bitcoin protocol;
•

we extend the definition of user network $\mathcal{U}$ by including unknown transactions;
•

we design a novel (theoretically sound) parsing methodology to study, and likely understand for the first time, unknown transaction patterns by using a specific class of graphs called T-DAGs;
•

we show that T-DAGs can be efficiently compared via isomorphism, offering a new mechanism for clustering similar transaction patterns.
•

we test our algorithms over the Bitcoin network, collecting and analysing all unknown transactions in the ledger from Bitcoin origins until block 591,872;
•

we further refine our results by removing trivial non-standard transactions already observed in the literature, revealing classes of hidden unknown transaction patterns never considered before;
•

to the best of our knowledge, this is the first study of the Bitcoin transaction network which includes unknown transactions.

With reference to the Bitcoin context, our contribution can be used to collect and observe the existing unknown transaction patterns in the ledger, as well as those that will be generated in the future. These patterns, previously ignored by any transactions analysis work, can be used to complete the user network $\mathcal{U}$ and the transaction network $\mathcal{T}$ . Our methodology can be applied to any clustering and de-anonymization technique to improve its effectiveness by leveraging a complete and reliable Bitcoin transactions database. Moreover, our novel solution can be extended immediately to any other system where transaction patterns can be modeled using T-DAGs.

Organization: The rest of the paper is organized as follows. In Section II we introduce preliminary concepts used in the paper. In Section III we summarize the state of the art for Bitcoin addresses clustering and graph isomorphism problem. In Section IV we define our theoretical model capable of extending user network in order to identify those sub-graphs that involve unknown transactions. In Section V, we describe the application of our theoretical model to the Bitcoin network, discussing the results achieved during our experimental evaluation. Finally, in Section VI we summarize our results, discuss the possible implications, and suggest future research lines.

II Background

II-A Preliminaries

Let $G=(V,E)$ be a graph with nodes $V$ and edges $E$ . We will consider both undirected graphs with $E\subseteq\binom{V}{2}$ , and directed graphs with $E\subseteq V\times V$ . An undirected graph $G$ is said to be connected if, for any $u,v\in V$ , there is a path from $u$ to $v$ . In a directed graph, edges are sometimes called arcs and the first vertex of an arc is referred to as the tail and the second one as the head. Given a directed graph $D=(V,E)$ , the underlying graph $G=(V,E^{\prime})$ is the undirected graph obtained with the same vertex set and an edge set corresponding of each arc in $E$ . That is, if $(u,v)$ is an arc in $D$ then both $(u,v)$ and $(v,u)$ are edges in $G$ . A directed graph is weakly connected if the underlying graph is connected. From now on, we will always refer to directed graphs and use edges and arcs interchangeably.

For an edge $e=\{u,v\}\in E$ , we say that $u$ is adjacent to $v$ , and vice-versa. If there exists a sequence of edges connecting two vertices $u$ and $v$ , we say that these two vertices are reachable. If $u$ is reachable starting from itself (through a non-empty sequence of edges) we say $G$ contains a cycle. If $G$ contains no cycles, $G$ is said to be acyclic. We call directed acyclic graphs DAGs. A tree is an undirected, connected, and acyclic graph. We use $K_{x,y}$ to indicate the complete bipartite graph between two sets of nodes $X$ and $Y$ with $|X|=x$ and $|Y|=y$ . Given a vertex $v\in V$ , $v$ ’s indegree, $\Gamma^{-}(v)$ , is the cardinal of edges (or arcs) that have $v$ as its head. Symmetrically, $v$ ’s outdegree, $\Gamma^{+}(v)$ , is the cardinal of edges (or arcs) that have $v$ as its tail. We call a vertex with zero indegree a source, and one with zero outdegree a sink. Given two graphs $G_{1}=(V_{1},E_{1})$ and $G_{2}=(V_{2},E_{2})$ , an isomorphism is a bijection function $f(.)$ between the sets of vertices, $V_{1}$ and $V_{2}$ , such that it maintains adjacency. That is, $(u_{1},u_{2})\in E_{1}\Rightarrow(f(u_{1}),f(u_{2}))\in E_{2}$ . A graph labeling is the assignment of labels to the vertices, edges, or both, of a graph. Given a graph $G$ , a canonical form is a labeled graph $\bar{G}$ , isomorphic to $G$ , such that $H$ is isomorphic to $G$ iff $\bar{G}=\bar{H}$ —i.e, they have identical canonical forms. An equivalence relation $\sim$ , is a binary relation on a set $S$ that is: (i) reflexive; (ii) symmetric; and, (iii) transitive. For any elements $a,b$ and $c$ in $S$ it holds: (i) $a\sim a$ , (ii) $a\sim b\Rightarrow b\sim a$ and (iii) $a\sim b\wedge b\sim c\Rightarrow a\sim c$ . An equivalence relation, $\sim$ , partitions a set in equivalence classes. A total ordering, $\prec$ , is a binary relation on a set $S$ that is: (i) antisymmetric; (ii) transitive; and, (iii) connex. That is, $\forall a,b,c\in S$ : (i) $a\prec b\wedge b\prec a\Rightarrow a=b$ ; (ii) $a\prec b\wedge b\prec c\Rightarrow a\prec c$ ; and, (iii) $a\prec b\cup b\prec a$ .

II-B Bitcoin Transaction Network and User Network

The Bitcoin network is composed by peers which can be either full nodes or clients, with full nodes validating transactions and blocks [10]. The Bitcoin protocol stores every transaction in a publicly distributed ledger, thus allowing everyone to read the transactions occurred among Bitcoin users. To ease the analysis of the money flow contained in the Bitcoin ledger, the Bitcoin data-set can be modeled with two different graphs (Transaction and User Network) as introduced in [37].
The Transaction Network $\mathcal{T}$ describes the Bitcoin flow between transactions over time [32]. Each node represents a transaction, and each directed edge between two nodes n1 and n2 represents a money flow that is an output for n1 and an input for n2. The network is a directed acyclic graph (DAG) since the output of a transaction can never be an input (either directly or indirectly) to the same transaction.
The User Network $\mathcal{U}$ describes the Bitcoin flow between users over time [28]. Each node represents a Bitcoin user, and each directed edge between two nodes $n_{1}$ and $n_{2}$ represents an inputs-outputs pair of a single transaction, where the inputs belongs to $n_{1}$ and the outputs belongs to $n_{2}$ . More in detail, the User Network is a weighted directed hypergraph $\mathcal{U}=(\mathcal{A},\mathcal{T})$ where $\mathcal{A}$ is the set of all the addresses used in the Bitcoin network, and $\mathcal{T}$ is the set of transactions. Every transaction $t\in\mathcal{T}$ can be modeled as a pair of ordered sets $(X,Y)$ with $X,Y\subseteq\mathcal{A}$ , where addresses included in $X$ are inputs of $t$ and addresses in $Y$ are outputs of $t$ .

Solution	Objective		Unknown Transactions		Considered Ledger Portion
Solution	Parser	Clustering	Considered	Study of Patters	From Block	To Block
Blocksci [21]	✓	✓	✗	✗	0	custom
Blockchain.com	✓	✓	✗	✗	0	last mined block
Bitcoincore	✓	✗	✗	✗	0	last mined block
[37]	✗	✓	✗	✗	0	130367
[1]	✓	✓	✗	✗	0	140000
[31]	✓	✓	✗	✗	0	231207
[23]	✗	✗	✗	✗	0	235000
[39]	✓	✓	✗	✗	0	267350
[27]	✗	✓	✗	✗	0	389799
[7]	✗	✗	✓	✗	0	453200
[14]	✗	✓	✗	✗	0	456520
Our Solution	✓	✓	✓	✓	0	591872

Table I: Comparison of our solution with popular parsers and state-of-the-art clustering techniques.

III Related Work

To the best of our knowledge, very few works have investigated unknown transactions in the Bitcoin ledger. In [7], the authors analyzed 1,887,708 transactions containing the OP_RETURN instruction. They found that $15\%$ of them are empty transactions, generated by different activities on the Bitcoin network, such as stress tests or DoS attacks. The remaining transactions are not empty but, similarly to the previous ones, they are not used for transferring funds. In fact, they have a different, specific goal: to store data in the Bitcoin ledger. The OP_RETURN transactions do not have a valid recipient, since they are not used to transfer funds. Therefore, they cannot be redeemed. For this reason, these transactions are not of particular interest for clustering and de-anonymizing techniques of Bitcoin users, i.e., they are present neither in $\mathcal{T}$ nor in $\mathcal{U}$ . A more in-depth analysis of non-standard transactions in the Bitcoin network has been proposed in [8]. The authors explored the ledger collecting and classifying both standard and non-standard transactions to understand why users sometimes do not adhere to the protocol. To achieve this goal, they mainly focus on analyzing non-standard transactions, classifying them into nine different typologies.
Although these studies analyzed some non-standard transactions, their purpose is only to analyze the semantic, considering every transaction as a stand-alone object. Consequently, such transactions are still ignored in the construction of both the user network $\mathcal{U}$ and the transaction network $\mathcal{T}$ , leading to incomplete and possibly unreliable data structures. To solve this problem, we first collected all the unknown transactions in the Bitcoin ledger, regardless of their semantic. We then focused on those that have a valid locking script as they have an impact on the de-anonymization and clustering techniques, neglected by all previous works in the field (see a summary in Section III-A). By using the proposed methodology, our framework is able to correctly parse unknown transactions, identify their patterns, and complete the user network $\mathcal{U}$ and the transaction network $\mathcal{T}$ with additional data never considered before.

III-A Bitcoin Addresses Clustering

Recently, security properties of blockchain-based protocols, with particular attention on the Bitcoin network, received increasing attention [15, 22, 24, 16, 12]. Among the different topics covered, several methods have been proposed to cluster bitcoin addresses [17]. These contributions build $\mathcal{U}$ by applying different heuristics to $\mathcal{T}$ . The most used heuristic, the Common Input, is based on the observation that all the inputs of a multi-input transaction belong to the same user [37]. This heuristic has been used to cluster Bitcoin addresses with the aim of deanonymizing users by using off-chain data [25] and study the Bitcoin network to uncover different properties [27, 35, 23]. Later, a more advanced heuristic called One-time Change was introduced [31, 14, 1, 39], based on the detection of the change among the output addresses.

All of these approaches built the transaction network $\mathcal{T}$ and the user network $\mathcal{U}$ by parsing the Bitcoin blockchain using the reference implementation (i.e. Bitcoin Core) or other blockchain explorer software. As explained in Section IV-A, current parsing algorithms, due to some incomplete assumptions, do not consider unknown transactions, providing incomplete and in some cases incorrect data to clustering and deanonymization algorithms.
Table I shows a comparison between the p0roposed solution and: (i) the Bitcoin reference implementation, i.e., Bitcoincore; (ii) blockchain explorer services, represented by the most popular one, blockchain.com; (iii) other platforms to analyze the Bitcoin blockchain, e.g., Blocksci; and, (iv) the most representative contributions in the literature that already used the user network $\mathcal{U}$ or the transaction network $\mathcal{T}$ .

III-B Graph Isomorphism

The graph isomorphism (GI) problem consists in determining whether given two graphs $\mathcal{G}$ and $\mathcal{V}$ there exists a bijection between both sets of vertices that preserves adjacencies. The lowest time bound for general GI stood since 1983 and until very recently at $\exp{\mathcal{O}\left(\sqrt{n\log n}\right)}$ [5]. However, polynomial time algorithms have long been known for different families of graphs. Among others, for graphs with bounded valence, the first polynomial time test dates from 1980 [26]. This work was used to solve the problem for graphs with bounded eigenvalue multiplicity [4] in $\mathcal{O}\left(n^{4m+c}\right)$ . For circulant graphs, a $\mathcal{O}(n^{2})$ testing algorithm is known [33]. The tree isomorphism problem was proven [18] to be linear in the number of vertices, that is, two trees $T_{1}$ and $T_{2}$ can be tested for isomorphism in $\mathcal{O}(n)$ comparing them in a bottom-up fashion. A result from 2016 [3] improved the 1983 bound for the GI problem to quasipolynomial time. The author proved an upper bound of $\exp{\left(\log n\right)^{\mathcal{O}(1)}}$ for testing wether two arbitrary graphs are isomoprphic or not.

A possible test of graph isomorphism is performed through canonical forms and canonical labelings. A canonical form of a graph is a representative of a class of graphs closed under isomorphisms, and with a linearly ordered vertex set [6]. That is, two graphs are isomorphic if and only if they yield the same canonical form. A canonical labeling is a string derived from a canonical form of a graph obtained through a defined mapping. The problem of obtaining canonical labelings for general graphs remains quasi-exponential in the number of vertices. A combinatorial method that runs in $\exp{n^{2/3}+o(1)}$ has long been known [6]. In the same paper the authors give a procedure to compute the canonical form of graphs of bounded valence (or degree) in polynomial time. Due to its relevant applications and the urge for practical solutions, implementations of graph canonization packaged in different software has proven to be successful. The McKay canonical labeling algorithm [30] packaged in the nauty software is one of the most celebrated solutions. It was the most popular solution from the early 80s until 2004, when a modern re-implementation [13], saucy, was introduced. Exploiting symmetry and sparsity in the search space pruning the latter managed to outperform the former by several orders of magnitude. The refinements later introduced in bliss [20] concluded in the release of Traces. The definition of the latter piece of software together with a comparison of all the software packages presented is discussed by its authors in [29].

When restricting the problem to rooted or un-rooted trees, linear is the best possible sequential runtime [18]. However, tree canonization has been proven to be in alternating logarithmic time [11]. The labeling therein defined is depth-first and the approach presented in this paper is breadth-first. In spite of that, the ordering relation we define is based on the same principles. However, a more recent approach [41] for tree canonical labeling uses a breadth-first approach. The authors therein do provide a less verbose labeling (at most $3n$ ) than ours, but this is due to the fact that they deal with specialized trees: ones where each vertex has an in-degree of at most 1.

IV Unknown Transactions Recognition

In this section, we introduce the general concept of unknown transactions, their classification, and the theoretical model to identify patterns within unknown transactions.

IV-A Unknown Transactions and Working Framework

The Bitcoin protocol provides its community with standard templates that must be used to create the locking and unlocking scripts that make up a transaction. The use of such templates is then enforced by miners using two functions, isStandardTx() and isStandard(), which check the compliance of each transaction’s inputs and outputs, respectively. In fact, a transaction is considered standard, and therefore accepted by the network, only if both functions return TRUE. If even one of them returns FALSE, the transaction is considered non-standard and discarded. This mechanism should prevent any use of Bitcoin transactions other than the ones conceived by the protocol, to avoid the spread of malicious transactions. However, even non-standard transactions can be included in the blockchain, thanks to miners who relax these controls [8].
Similar to the concept of non-standard transaction, we define unknown transactions as follow:

Definition 1 (Unknown Transaction)

We call a transaction (TX) unknown if it contains an input or an output with a Null value address, i.e. not correctly identified by the Bitcoin Core client.

This definition embraces a set of Bitcoin transactions, of which non-standards are currently a subset, regardless of what the protocol considers standard or non-standard. The concept of unknown transactions allows us to protect our framework from future variations of the Bitcoin protocol and guarantees compatibility with other systems.
Among unknown TXs, we are only interested in transactions whose output contains a Null address. This is because the address of an input is unequivocally determined by that of the output that it is spending. Hence, resolving the address of the output is equivalent to resolving the address of the associated input as well. Additionally, an input only exists to fund the outputs contained in its corresponding transaction. Thus, there is no such thing as an address for an input as it does not belong to someone. However, for simplicity, we agree to assign to an input the same address held by the output that it is spending. To unambiguously process the list of transaction hashes, we need to uniquely identify inputs and outputs. This is done using the Unique Transaction Input-Output Identifier, introduced below in Definition 2. But, before introducing such an Fidentifier, we motivate why it is required by our approach.

For elaborated blockchain analysis, it is a good idea to initially parse all the data from a running miner and, once the data is organized in a more accessible manner, apply further and more complex post-processing. However, we have discovered several imprecisions in the blockchain’s parsing process:

(i)

Excess of abstraction Blockchain parsers such as BlockSCI [21] introduce a completely new level of abstraction over the one already specified in the reference implementation [9]. Defining new wrappers, lots of different classes, and incomplete references can make a parser difficult to use and debug. Additionally, we discovered that, in the particular case of BlockSCI, there are errors in their parsing methodology;
(ii)

Excess of identifiers Bitcoin’s blockchain is an environment based on uniqueness. Every item must be uniquely identified and hashing algorithms already provide a way to do so. However, some parsers [21] insist on giving an alternative enumeration for transactions and addresses. This makes databases harder to navigate and makes it non-intuitive to mimic the client’s behavior or debug the processed data;
(iii)

Using Public Keys as Identifiers As introduced in Section I, to the best of our knowledge, existing works [27, 1], clustering Bitcoin Addresses to find real end-users, assume that each transaction output must have an address assigned to it. This is false. In fact, up to block with height $481823$ , Bitcoin’s blockchain contains $3255688$ unknown transactions.

Given the above problems, our proposed framework aims to provide a reference for storing Bitcoin’s data in a database; minimizing the amount of abstraction involved, reusing whenever possible the identifiers provided by the reference implementation, and keeping the structure simple and clear.

To be consistent with both (i) and (ii), we only introduce the critical functionalities not covered by the Bitcoin core ¹¹1The core client cannot find which transaction input is spending a given unspent output.. This way, parsers using our framework can be easily compared against each other, the Bitcoin Core Client, or even web explorers. Further layers of abstraction depending on the application should be detached from the parsing phase to avoid situations where complex post-processing is discredited by incorrect data parsing. This means that all non-relevant information for blockchain navigation is not included in the framework.

The Framework Our framework uses only minimal abstraction and provides a robust, reliable, and fast way to navigate through the Bitcoin’s blockchain. It is also easily portable: all applications that query or do some sort of post-processing with Bitcoin’s data can use it. To fulfill these conditions and to preserve minimality, only the necessary attributes are included. All other features included in the reference implementation, that provide key information about each transaction, but do not improve the exploration of the blockchain, are discarded. In fact, they can be easily obtained by using the identifiers provided by our framework, together with any Bitcoin client.

We present a database layout that only contains two types of entities: block and tx.

(i)

block: represents a block in the blockchain. It is uniquely identified by two parameters: hash and height. Both the parameters can be used to retrieve a block element from the database without ambiguity. Each block element has an additional parameter, tx. tx is an array of hashes, each one referencing a transaction included in the block (see next item for a description on the tx entity). Each element in the array can be uniquely identified, and accessed, by the index of their position within the array. This way, the $m$ -th transaction of the $n$ -th block can be identified without uncertainty;
(ii)

tx: represents a confirmed transaction (TX) in the blockchain. Since the Bitcoin’s ledger contains different cases of transactions with the same hash²²2Blocks 91812 and 91842 contain a transaction with hash: “d5d27987d2a3dfc724e359870c6644b40e497bdc0589a033220fe15429d88599”., this attribute cannot be used as a unique identifier. We realized that the Bitcoin Core client still uses the hash attribute to uniquely identify a transaction, causing a loss of information: searching for a particular transaction hash, the Core client returns only the last occurrence of that hash in the ledger. As a result, any transaction stored in the blockchain with a hash equal to a more recent transaction will never be returned by the client. For this reason, we uniquely identify a transaction using its attribute pair <blockhash, hash> which represents the hash of the block they belong to and its hash, respectively. The vin attribute is an array of pairs <txID, txID[vout]>. It represents the set of inputs contained in the transaction. Each input can be uniquely identified by their index within the transaction input array ³³3Since inputs do not have a global unique identifier, and since the reference client implementation does not define input entities, we have chosen not to do so either.. The txID attribute from the pair is the hash attribute of the TX that contains the output that the input is spending and txID[vout] is the index of the spent output within the TX that contains it. Symmetrically, the vout attribute is an array of pairs <txID, txID[vin]> where, if the output is spent by some input in the future, the TX hash, and the index within the transaction where the output is spent, are included. If the TX is unspent, both attributes are set to null. Each output can be uniquely identified through the hash of the transaction they are contained in and their index in the output array (tx.vout).

The above introduced structure leads to Definition 2, which we will use often in the rest of the paper.

Definition 2 (TIO)

A Unique Transaction Input-Output (TIO) is an identifier that can uniquely identify all the inputs and outputs contained in confirmed transactions within the blockchain. We denote the set of all inputs and outputs as $<\text{TIO{}}>$ .

A first contribution of our framework is the possibility to travel to the future in the blockchain. This allows us to easily identify the paths followed by bitcoins through the blockchain history. Definition 3 formalizes some new terminology related to our traveling mechanism.

Definition 3 (Traveling the Blockchain)

Given a TIO, we define the current terms:

(i)

If the TIO corresponds to an Input:

(a) The spending output of TIO refers to the output that this input is using;

(b) The funded outputs of TIO refers to the outputs that this input is providing bitcoins to. By Bitcoin design, we assume that the funded outputs for an input are all the outputs contained in the same transaction than the input.
(ii)

If the TIO corresponds to an Output:

(a) The spending inputs of TIO are all the inputs that funded this output. By Bitcoin design, we assume that all the spending inputs for an output are all the inputs contained in the same transaction than the output.

(b) If the output is spent, the funded input is the input that is spending the output. Note that, this input will appear in a more recent transaction than the one containing the output.

A brief summary of the data structure used in our framework is provided in Table II.

Table II: Summary table of our data structure.

block tx $\ast$ hash $\ast$ hash $\ast$ height $+$ blockhash $+$ tx $\coloneqq\left[+\text{ {hash} }\right]_{<n>(\ast)}$ $+$ vin $\coloneqq\left[\begin{array}[]{c}+\text{ {txID} }\\ +\text{ {txID[vout]} }\end{array}\right]_{<n>(\ast)}$ $+$ vout $\coloneqq\left[\begin{array}[]{c}+\text{ {txID} }\\ +\text{ {txID[vin]} $(\#)$ }\end{array}\right]_{<n>(\ast)}$ Legend: $\ast\coloneqq$ Attribute is a unique identifier for the entity. $+\coloneqq$ Attribute of a given entity. $\#\coloneqq$ New attributes that do not appear in the reference implementation. $\left[\cdots\right]_{<n>(\ast)}\coloneqq$ Array of elements with the attributes specified between brackets. These elements can be uniquely identified within their container by their position in the array (indexed by an integer $n$ ).

Locking Script

In addition to its TIO, we are also interested in the locking script for an output. By locking script, we refer to the script that has to be redeemed in order to spend the output. In the Bitcoin Core reference implementation, it is referred as scriptPubKey.
In the later stages of our methodology, we will use TIOs to build the Unknown TX T-DAGs. Instead, the locking scripts will be used to filter our results by removing T-DAGs generated by transactions with purposes other than the transfer of crypto coins.

IV-B Unknown TX T-DAG Construction

In this section, we lay the blockchain data in a graph using the framework defined in IV-A, introduce the concept of Unknown TX graphs and study the derived T-DAGs. The study of these directed graphs will enable us to describe, tailor and identify unknown transaction patterns on the Blockchain.

Definition 4 (TIO graph)

Let $<\text{TX}>$ be the set of confirmed transactions in the blockchain. Let $G_{\text{TIO{}}}=(V,E)$ be a directed unweighted graph such that:

(i)

$V=\hskip 3.0pt<\text{TIO{}}>$
(ii)

$E=\left\{\bigcup\limits_{t\in<TX>}\left\{E\left(K_{|I_{t}|,|O_{t}|}\right)\bigcup\limits_{\begin{subarray}{c}o\in O_{t}\\ o\not\in\text{UTXO}\end{subarray}}\left(o,\text{gFI}(o)\right)\right\}\right\}$

where gFI returns the funded input of a given output, given a transaction $t$ , $I_{t}$ and $O_{t}$ denote $t$ ’s set of inputs and outputs respectively, and UTXO is the unspent transaction output store.

Lemma 1.

The TIO graph, $G_{\text{TIO{}}}$ , is a directed acyclic graph (DAG).

Proof.

Nodes in the graph represent validated inputs or outputs in the blockchain. This means that, when they were broadcast to the network, each miner checked them. For an input or an output to be validated, they must always point to an event that happened in the past. Each edge then goes from an event that happened further in the past to a more recent one. This timestamp characteristic is sufficient to ensure that there are no cycles. ∎

Definition 5 ( $\alpha$ -nodes)

An $\alpha$ -node is a set of vertices $S$ from $G_{\text{TIO{}}}$ such that, exists a transaction $T$ such that its set of inputs $I_{T}=S$ and

(i)

$S$ is a coinbase⁴⁴4A coinbase transaction is a special transaction in the Bitcoin protocol creating new coins as mining rewards [2]. transaction, or
(ii)

$\exists\,s\in S$ such that $s$ is spending an output with a BTC Address.

Remark 1.

For each transaction $T$ , its set of inputs $I_{T}$ fulfills:

(i)

$I_{T}$ is an $\alpha$ -node, or
(ii)

$\forall s\in I_{T}$ , $s$ is spending an output with a None address.

The introduction of $\alpha$ -nodes and the previous remark identifies a natural contracted graph of $G_{\text{TIO{}}}$ .

Definition 6 (Contracted TIO graph)

The Contracted TIO graph, $G_{\text{TIO{}}}^{*}$ , is the graph resulting of applying the following two transformations to $G_{\text{TIO{}}}$ :

(i)

Identify (contract) all vertices [36] in an $\alpha$ -node. Repeat for each different $\alpha$ -node contained in $G_{\text{TIO{}}}$ .
(ii)

For each transaction T fulfilling the second condition in Remark 1,

(a) for each spending output o of each input in $I_{T}$ , add an edge from o to each output in T.

(b) remove every vertex in $I_{T}$ , as well as its inbound and outbound edges.

Remark 2.

The transformations applied to $G_{\text{TIO{}}}$ do not introduce cycles and, as a consequence, $G_{\text{TIO{}}}^{\ast}$ is also a DAG.

To define the subgraphs in the TIO graph relevant for our research, we still have to introduce some more concepts.

Definition 7 (Termination application)

Let $f$ be a function, $f:\,\,<\text{TIO{}}>\longrightarrow\{0,1\}$ defined as follows:

f(x)=\left\{\begin{array}[]{cl}0&\text{ if $x$'s address is {None}}\\ 1&\text{otherwise}\end{array}\right.

Given a weakly connected single-source DAG, we call all nodes that are not the source nor sinks internal nodes.

Definition 8 (Unknown TX graph)

An Unknown TX graph is a single-source, weakly connected, maximal induced subgraph of $G_{\text{TIO{}}}^{*}$ such that:

(i)

The source $s$ is an $\alpha$ -node.
(ii)

Each sink $t$ fulfills $f(t)=1$ .
(iii)

Each internal node $v$ fulfills $f(v)=0$ .

Unknown TX graphs will be our object of study for the rest of the paper. From their construction, we observe the following points.

Definition 9 (T-DAG)

A T-DAG is a single-source directed unlabeled acyclic weakly connected graph.

Remark 3.

Lemma 1 and Remark 2 prove that an Unknown TX graph is a T-DAG.

From now on, we will refer to Unknown TX graphs as Unknown TX T-DAGs⁵⁵5Unlike trees, a vertex in a T-DAG may have more than one parent..

Remark 4.

If we fix a source $s$ , then there exists only one Unknown TX T-DAG with $s$ as its root. We can then denote as $G(s)$ the Unknown TX DAG generated by a given root $s$ .

Definition 10 (Set of Unknown TX T-DAGs)

We define the set of Unknown TX T-DAGs, $\mathcal{D}$ , as follows:

\begin{split}\mathcal{D}=&\left\{G(s):s\text{ is an }\alpha\text{-node and $G(s)$ has at least two vertices}\right\}\end{split}

Algorithm 1 presents a procedure to generate the Unknown TX T-DAG given an $\alpha$ -node $s$ .

Algorithm 1 Unknown TX T-DAG Generation from its root.

1:procedure T-DAG Generation(

s

)

G\leftarrow\text{Graph}()

S\leftarrow\text{Stack}()

G

.addNode(

s

)

5: for all

tx\_out

in getFundOutput(

s

) do

G

.addNode(

tx\_out

)

G

.addEdge(

s

tx\_out

)

S

.push(

tx\_out

)

9: while !

S

.isEmpty() do

10:

tx\_out\leftarrow S.

pop()

11: if !

f(tx\_out)

then

12:

tx\_in\leftarrow

getFundInput(

tx\_out

)

13: for all

new\_out

in getFundOutput(

tx\_in

) do

14:

G

.addNode(

new\_out

)

15:

G

.addEdge(

tx\_out

new\_out

)

16:

S

.push(

new\_out

)

An example of an Unknown TX DAG is presented in Figure 1. Note that, we attach the associated address for each node.

Refer to caption — Figure 1: Representation of an Unknown TX T-DAG.

T-DAG Compression

Before any further processing of the Unknown TX T-DAGs, we are interested in pruning the inner nodes that were published to the blockchain without a locking mechanism. These are trivial locking scripts that can be solved by anyone at any moment. It is easy to verify that a pruned maximal unknown TX T-DAG is still a T-DAG whose leaves fulfill the termination condition. The pruning process involves removing the trivial output and linking the funded outputs with the spending ones. Figure 2 illustrates the compression process ⁶⁶6Note that, since inputs are initially removed from the compressed graph, we actually link funded outputs with the spending outputs associated to the removed inputs. Nonetheless, in Figure 2, we have included the input for better comprehension..

From now on, when we refer to an Unknown TX DAG we will actually be referring to its compressed version.

IV-C Chain Abstractions and Post-Processing

After compressing an Unknown TX DAG, we need to store it. To do so, we will use a canonical representation of our graph. A canonical labeling is a string derived from a graph such that two different graphs are isomorphic if and only if they yield the same string. A canonical representation (i.e. the canonical labeling associated to a set of isomorphic graphs) is then the representative of a graph isomorphism class. In this section we present a total ordering for T-DAG isomorphism classes and a canonical labeling for T-DAGs.

A total ordering for T-DAG isomorphism classes

Let us first introduce the notation that we will use in this section. Let $T$ be a T-DAG (see Definition 9) with its root denoted by $t$ . Given a T-DAG $T=(V,E)$ , we say that $|T|\coloneqq|V|$ . Given a vertex $v\in V$ , $\Gamma^{+}(v)$ is its outdegree. Lastly, given a T-DAG $T$ the children of the root $t$ are $t_{1},\dots,t_{\Gamma^{+}(t)}$ . The set $(T_{1},\dots,T_{\Gamma^{+}(t)})$ denotes the maximal collection of sub DAGs induced on $T$ having $t_{1},\dots,t_{\Gamma^{+}(t)}$ as roots.

Definition 11 ( $\prec$ - relation)

Given two T-DAGs, $S$ rooted at $s$ and $T$ rooted at $t$ , we say $S\prec T$ if

(i)

$|S|<|T|$ , or
(ii)

$|S|=|T|\wedge\Gamma^{+}(s)<\Gamma^{+}(t)$ , or
(iii)

$|S|=|T|\wedge\Gamma^{+}(s)=\Gamma^{+}(t)=k\,\wedge$ for the first index $i\leq k$ for the ordered sets $(S_{1},\dots,S_{k})$ and $(T_{1},\dots,T_{k})$ (where $S_{1}\preceq\dots\preceq S_{k}$ and $T_{1}\preceq\dots\preceq T_{k}$ ) where $S_{i}$ differs from $T_{i}$ , it holds $S_{i}\prec T_{i}$ .

Definition 12 ( $\equiv$ - equality)

Given two T-DAGs $T$ and $S$ , we say $T\equiv S$ if neither $T\prec S$ nor $T\succ S$ hold.

Let $\cong$ be the isomorphism operator. We derive the following lemma:

Lemma 2.

Given two T-DAGs $T$ and $S$ with $|T|=|S|=n$ , then $T\equiv S\Leftrightarrow T\cong S$ .

Proof.

We prove each implication separately,

$[\Rightarrow]$ Let $S_{n}$ be the symmetric group acting on the vertices of $T$ , $V(T)$ . Given $\sigma\in S_{n}$ , we denote the action of $\sigma$ on $v\in V(T)$ by $\sigma(v)$ . We can naturally extend the definition to sets of vertices, $S\subseteq V,\sigma(S)=\{\sigma(v):v\in S\}$ , and to the T-DAG itself $\sigma(T)\coloneqq(\sigma(V),E^{\prime})$ , where $E^{\prime}=\{(\sigma(u),\sigma(v)):(u,v)\in E(T)\}$ .

Remark 5.

Let $S$ and $T$ be two T-DAGs, $|T|=|S|=n$ . Then,

S\cong T\Leftrightarrow\exists\,\sigma\in S_{n}\text{ such that }\sigma(S)=T

We now prove this direction of the lemma by induction on the size of the T-DAG, $|T|$ .

If $|T|=1$ : then both $S$ and $T$ are T-DAGs formed by a single vertex, hence they are the same graph and therefore isomorphic taking the identity permutation.

If $|T|=n$ : to prove the inductive step we assume that for any pair of T-DAGs with size $<n$ , then $T\equiv S\Rightarrow T\cong S$ . If now $|T|=n$ , $T\equiv S\Rightarrow|T|=|S|=n$ , $\Gamma^{+}(t)=\Gamma^{+}(s)=k$ , and $(T_{1},\dots,T_{k})\equiv(S_{1},\dots,S_{k})$ pairwise, where $S_{1}\preceq\dots\preceq S_{k}$ and $T_{1}\preceq\dots\preceq T_{k}$ . That is, $\forall i\in{1,\dots k}\left\{\begin{array}[]{c}T_{i}\equiv S_{i}\\ |T_{i}|=|S_{i}|<n\end{array}\right.\overset{\tiny{Ind.H}}{\Longrightarrow}T_{i}\cong S_{i}\overset{\tiny{R.~{}\ref{remark:permutation}}}{\Longrightarrow}\exists\sigma_{i}\in S_{|T_{i}|}\text{ such that }\sigma_{i}(T_{i})=S_{i}$ . We now consider the following permutation: $\sigma=\sigma_{1}\circ\dots\circ\sigma_{k}\circ(t\rightarrow s)$ , the composition of all the permutations that match each subtree and the map from a root to the other. $\sigma$ fulfils that $\sigma(T)=S\overset{\tiny{R.~{}\ref{remark:permutation}}}{\Longrightarrow}T\cong S$

$[\Leftarrow]$ We argue again by induction on the size of the T-DAG. The base case is the same as before so we do not repeat it. For the induction step, we have:

If $|T|=n$ : $T\cong S\Rightarrow|T|=|S|=n\wedge\Gamma^{+}(t)=\Gamma^{+}(s)=k$ . Additionally, ordering the subtrees $(T_{1},\dots,T_{k})$ , $(S_{1},\dots,S_{k})$ such that $T_{1}\preceq\dots\preceq T_{k}$ and $S_{1}\preceq\dots\preceq S_{k}$ necessarily $T_{i}\cong S_{i}\forall i\in\{1,\dots,n\}$ with $|T_{i}|=|S_{i}|<n\overset{Ind.H}{\Longrightarrow}T_{i}\equiv S_{i}\Rightarrow(T_{1},\dots,T_{k})\equiv(S_{1},\dots,S_{k})\Rightarrow T\equiv S$ .
∎

Remark 6.

The operators $(\prec,\succ,\equiv)$ induce a total ordering on T-DAGs isomorphism classes.

Proof.

Given $T$ and $S$ two T-DAGs,

(i)

Antisymmetry: $S\preceq T\wedge S\succeq T\Leftrightarrow\neg(S\succ T)\wedge\neg(S\prec T)\Leftrightarrow S\equiv T\Leftrightarrow S\cong T$
(ii)

Transitivity: clearly holds by definition.
(iii)

Connex Property: $S\preceq T\vee S\succeq T\Leftrightarrow\neg(A\succ B)\vee\neg(A\prec B)\Leftrightarrow\neg(A\succ B\wedge A\prec B)\Leftrightarrow\neg 0=1$

∎

Canonical labeling for T-DAGs

We now introduce a canonical labeling for T-DAGs. We provide unique representatives for T-DAG isomorphism classes and their string representation.

Definition 13 ( $\Delta$ -operator)

The indegree operator ( $\Delta$ ) is a total ordering on equivalence classes of the $\equiv$ -relation. Let $T$ be a T-DAG such that $T_{1}\preceq\dots\preceq T_{\Gamma^{+}(t)}$ . That is, it takes a set of representatives of the $\equiv$ -relation and orders it. Formally,

\begin{split}\Delta:&\{T_{1},\dots,T_{\Gamma^{+}(t)}\}/\equiv\longrightarrow\{T_{1},\dots,T_{\Gamma^{+}(t)}\}/\equiv\\ &\{\overline{T_{i}},\dots,\overline{T_{i+k}}\}\longmapsto\Delta(\{\bar{T_{i}},\dots,\overline{T_{i+k}}\})\coloneqq(\{\overline{T_{i}},\dots,\overline{T_{i+k}}\},\leq_{\ast})\end{split}

where $k\in\mathbb{N}$ and $(\{\overline{T_{i}},\dots,\overline{T_{i+k}}\},\leq_{\ast})$ is the totally ordered set according to the following relation:

\begin{split}\overline{T_{i}}\leq_{*}\overline{T_{j}}\Leftrightarrow&\left(\{|\Gamma^{-}(t_{i_{1}})|,\dots,|\Gamma^{-}(t_{i_{\Gamma^{+}(t_{i})}})|\},\leq\right)\leq\\ &\left(\{|\Gamma^{-}(t_{j_{1}})|,\dots,|\Gamma^{-}(t_{j_{\Gamma^{+}(t_{j})}})|\},\leq\right)\end{split}

That is, $\overline{T_{i}}\leq_{\ast}\overline{T_{j}}$ iff the non-decreasing indegree sequence of $t_{i}$ ’s children is pairwise smaller than that of $t_{j}$ .

Naturally, applying the operator to the whole set (i.e. $\Delta(T)$ ) means applying it element-wise in the quotient set, reordering only elements that were considered $\equiv$ -equal.

Definition 14 (T-DAG isomorphism classes representative)

Given a T-DAG, $T$ , we reorder it so that $T_{1}\preceq\dots\preceq T_{\Gamma^{+}(t)}$ . We denote the reordered T-DAG with $T^{\ast}$ . The representative of $T$ ’s isomorphism class $\bar{T}$ is defined as $\bar{T}\coloneqq\Delta(T^{\ast})$ .

Lemma 3.

Given $T$ and $S$ T-DAGs, $T\cong S\Rightarrow\bar{T}=\bar{S}$ . Thus $\bar{T}$ is well defined.

Proof.

Given a vertex $v\in V(T)$ , the height of $v$ is the number of edges of the longest path between $v$ and one of its leafs. Let $V(T)_{h}\subset V(T),$ be the set of vertices with height equal to $h$ . We prove that, given an $h$ , the set of maximal induced T-DAGs rooted at $V(T)_{h}$ and $V(S)_{h}$ , reordered with $\preceq$ and then with $\Delta$ , are the same. In particular, when $h$ equals the height of the T-DAG (i.e. the height of its root), this yields $\bar{T}=\bar{S}$ . We proceed by induction on the height $h$ .

If the height is 0: $T\cong S\Rightarrow|\{v\in V(T):\Gamma^{+}(v)=0\}|=|\{v\in V(S):\Gamma^{+}(v)=0\}|=k$ . In fact both T-DAGs have the same number of leafs and therefore $\{t_{1},\dots,t_{k}\}=\{s_{1},\dots,s_{k}\}$ .

If the height is $h$ : we assume that, for heights $\leq h$ , the set of T-DAGs reordered with $\preceq$ and then with the $\Delta$ operator are the same. $T\cong S\Rightarrow|V(T)_{h+1}|=|\{v\in V(T):\text{height}(v)=h+1\}|=|\{v\in V(S):\text{height}(v)=h+1\}|=|V(S)_{h+1}|=k$ . We now order $V(T)_{h+1}$ and $V(S)_{h+1}$ in non-decreasing outdegree order and we apply the $\Delta$ operator to $\{T_{1},\dots,T_{k}\}$ and $\{S_{1},\dots,S_{k}\}$ , the T-DAGs with roots in $V(T)_{h+1}$ and $V(S)_{h+1}$ . We will refer to the before-mentioned roots as $\{t_{1},\dots,t_{k}\}$ and $\{s_{1},\dots,s_{k}\}$ , and to the $j-$ th sibling of the $i-$ th root as $t_{i}^{j}$ , where $j\in\{1,\dots,\Gamma^{+}(t_{i})\}$ (with $T_{i}^{j}$ being the T-DAGs rooted at this nodes). $\{T_{1},\dots,T_{k}\}$ and $\{S_{1},\dots,S_{k}\}$ ordered in this manner satisfy $T_{1}\preceq\dots\preceq T_{k}$ and $S_{1}\preceq\dots\preceq S_{k}$ . Furthermore, for each T-DAG $T_{i}$ , $i\in\{1,\dots,k\}$ , with root $t_{i}$ at layer with height $h$ , we have that

\begin{split}\left.\begin{array}[]{l}\Gamma^{+}(t_{i})=\Gamma^{+}(s_{i})\\ \forall j\in\{1,\dots,\Gamma^{+}(t_{i})\}\left\{\begin{array}[]{c}\Gamma^{-}(t_{i}^{j})=\Gamma^{-}(s_{i}^{j})\\ \text{Ind. H}\Rightarrow T_{i}^{j}=S_{i}^{j}\end{array}\right.\end{array}\right\}\Rightarrow T_{i}=S_{i}\end{split}

All vertices with height $h+1$ have as children the roots of T-DAGs with heights $\leq h$ .

If now we make $h+1$ equal $T$ and $S$ ’s height, we have $V(T)_{h+1}=t$ and $V(S)_{h+1}=s$ . Therefore, the set of maximal induced T-DAGS are $\{T\}$ and $\{S\}$ respectively. We have proven that, reordering with $\preceq$ and $\Delta$ , both sets are equal. Hence, $\bar{T}=\bar{S}$ . ∎

Remark 7.

It holds $T\cong\bar{T}$ .

Proof.

From Definition 14 we observe that, in order to obtain $\bar{T}$ from $T$ , we reorder the subtrees non-decreasingly and we apply the $\Delta$ operator. Let then $\sigma$ be a permutation such that $\sigma(T)$ generates $T_{1}\preceq\dots\preceq T_{\Gamma^{+}(t)}$ . We then consider $\mu$ as the permutation resulting of doing $\sigma$ and $\Delta$ one after the other in this order. From Remark 5 it follows that $\mu(T)=\bar{T}\overset{R.~{}\ref{remark:permutation}}{\Longrightarrow}T\cong\bar{T}$ . ∎

Definition 15 (T-DAG labeling)

Given a T-DAG $T$ we identify its vertices traversing $T$ breadth-first with a FIFO queue and, starting from the root, for each new vertex (not identified that we dequeue) we assign it the current vertex count value, increment the count by one and queue its set of children. The labeling lbl $(T)$ , associated to $T$ , is the string result from traversing the identified $T$ breadth-first with a FIFO queue and, starting from the root, for each vertex (not processed that we dequeue) we append each of its children identifier to the labeling and queue each of its children. We separate children of the same parent with a comma ’,’, sets of siblings with a colon ’:’, and we denote the end of the label with a semi-colon ’;’ ⁷⁷7We are aware that these separators depend heavily on the labeling implementation.. We will refer to the identifier (the label) of a vertex $v$ obtained through this procedure as id $(v)$ .

Definition 16 (T-DAG canonical labeling)

Given a T-DAG $T$ , the canonical labeling of $T$ , c $(T)$ is the labeling of its isomorphism class representative, $\bar{T}$ . That is, c $(T)\coloneqq\text{lbl}(\bar{T})$ .

In a nutshell, the canonical labeling is obtained with a total ordering for T-DAGs, an indegree-based operator and an additional labeling Sufficiency is proven in Lemma 7.

Before proving the sufficiency of the three operations, we introduce additional concepts regarding labelings.

Definition 17 (Labeling clause)

Given a labeling lbl $(T)$ of a T-DAG $T$ , a clause in the labeling is a set of identifiers contained within either two colons, a colon and a semi-colon, or the beginning of the labeling and a colon.

Note that, given a labeling lbl $(T)$ , we can index the clauses by order of appearance starting from $0$ . Thus, we can think of lbl $(T)$ as an ordered array of clauses: lbl $(T)=[c_{0},\dots,c_{n}]$ , where $n=|V(T)|$ . Thus, an upper-bound to get the length of each clause, by preprocessing the label, is $\mathcal{O}(m)$ , where $m=|E(T)|$ .

Given a T-DAG labeling lbl $(T)$ , we can obtain an array out_deg that in the $i$ -th position contains the outdegree of $v\in V(T)$ such that id $(v)=i$ . Algorithm 2 presents a pseudo-code to do so. The proofs of correctness (Lemma 4) and complexity (Lemma 5) are omitted for space reasons.

Algorithm 2 Outdegree sequence from a label.

1:procedure Outdegree Parsing(lbl

(T)

)

m\leftarrow\max(\text{lbl}(T))

out\_deg\leftarrow\text{zeros-array}(m+1)

processed\leftarrow\text{zeros-array}(m+1)

Q\leftarrow\text{FIFO\_Queue}()

n\_id\leftarrow 0

7: for

i:1

\#

clauses in lbl(

T

) do

8: while !

Q.\text{empty}()

n\_id\leftarrow Q.\text{dequeue}()

10: if !

processed[n\_id]

then

11: break

12: if !

processed[n\_id]

then

13:

processed[n\_id]\leftarrow\text{True}

14:

out\_deg[n\_id]\leftarrow\text{len}(c_{i})

15: for all

id

c_{i}

16:

Q

.queue(

id

)

17: return

out\_deg

Lemma 4.

Algorithm 2 is correct.

Lemma 5.

The Algorithm described in Algorithm 2 is linear in the number of edges of the labeled T-DAG $T$ . Thus, if $m=|E(T)|$ , the complexity is $\mathcal{O}(m)$ .

Lemma 6.

Let $S$ and $T$ be T-DAGs with $|S|=|T|$ and $\Gamma^{+}(s)=\Gamma^{+}(t)=k$ . Let $\bar{S}$ and $\bar{T}$ be their isomorphism class representatives.

\bar{S}\ncong\bar{T}\Rightarrow\exists\,u\in V(\bar{S}),v\in V(\bar{T})\text{ s.t. }

\text{id}(u)=\text{id}(v)\text{ but }\Gamma^{+}(u)\neq\Gamma^{+}(v)

where id is the identifier assigned to each vertex when labeled.

Proof.

W.l.o.g. we assume $\bar{S}\prec\bar{T}$ . This implies that either $(i)$ , $(ii)$ , or $(iii)$ from Definition 11 must hold. Let us define the following set of indices, $D\coloneqq\{i\in\{1,\dots,k\}:\bar{S_{i}}\ncong\bar{T_{i}}\}$ . From the hypothesis we can ensure that $D\neq\emptyset$ . For each index $i\in D$ , we take $u_{i}$ the first vertex traversing $\bar{S_{i}}$

breadth-first with a FIFO queue such that condition $(ii)$ fails. Note that, from Definition 12 and Lemma 2, $\bar{S_{i}}\ncong\bar{T_{i}}$ implies that at some point of the recursion $(i)$ or $(ii)$ will fail. And if $(i)$ fails, necessarily does $(ii)$ . Let $U$ be, $U\coloneqq\{u_{i}:i\in D\}$ , and $u^{\ast}$ ,

u^{\ast}\coloneqq\underset{u\in U}{\text{argmin}}\{id(u)\}

Let $v^{\ast}$ be the corresponding vertex in $V(\bar{T})$ that made condition $(ii)$ fail for $u^{\ast}$ . For all vertices with a smaller id, the outdegree is the same. Additionally, since $\bar{S}$ and $\bar{T}$ are both ordered non-decreasingly, when traversed breadth-first with a FIFO queue, we can affirm that $u^{\ast}$ and $v^{\ast}$ will be given the same identifier but they have different outdegrees. ∎

Lemma 7.

Given two T-DAGs $T$ and $S$ , it holds $T\cong S\Leftrightarrow c(T)=c(S)$ . Thus, the canonical labeling presented in Definition 16 is well defined.

Proof.

We prove each implication separately,

$[\Rightarrow]$ This implication is an immediate consequence of Lemma 3. $T\cong S\Rightarrow\bar{T}=\bar{S}\Rightarrow c(T)=c(S)$ , since the process described in Definition 16 is deterministic.

$[\Leftarrow]$ We will prove this implication by contrapositive. We will see that $T\ncong S\Rightarrow c(T)\neq c(S)$ . Consider two T-DAGs, $T$ and $S$ , such that $T\ncong S$ .

Since $T\ncong S$ then it must be $\bar{T}\ncong\bar{S}$ what implies $\bar{T}\not\equiv\bar{S}$ . Therefore, either $\bar{T}\prec\bar{S}$ or $\bar{S}\prec\bar{T}$ must hold. Without loss of generality we assume $\bar{S}\prec\bar{T}$ . Lemma 4 proofs that, given a T-DAG labeling, lbl $(T)$ , we can obtain $|T|$ and $T$ ’s outdegree sequence. As a consequence, lbl $(\bar{S})=$ lbl $(\bar{T})\Rightarrow|\bar{S}|=|\bar{T}|$ , and two vertices with the same id have the same outdegree. Furthermore, from Definition 11 we recall:

\bar{S}\prec\bar{T}\Rightarrow\left\{\begin{array}[]{l}(i)|\bar{S}|<|\bar{T}|\text{ , or}\\ (ii)|\bar{S}|=|\bar{T}|\wedge\Gamma^{+}(\bar{s})<\Gamma^{+}(\bar{t})\text{ , or}\\ (iii)|\bar{S}|=|\bar{T}|\wedge\Gamma^{+}(\bar{s})=\Gamma^{+}(\bar{t})=k\wedge\\ \hskip 19.0pt\left(\bar{S_{1}},\dots,\bar{S_{k}}\right)\prec\left(\bar{T_{1}},\dots,\bar{T_{k}}\right)\end{array}\right.

$(i)\Rightarrow|\bar{S}|\neq|\bar{T}|\overset{L.~{}\ref{lemma:outdegree-sequence}}{\Longrightarrow}\text{ lbl}(\bar{S})\neq\text{ lbl}(\bar{T})\Rightarrow\text{ c}(S)\neq\text{ c}(T)$

$(ii)\Rightarrow\Gamma^{+}(\bar{s})\neq\Gamma^{+}(\bar{t})\Rightarrow$ clause $\#1$ in each label will have a different size $\Rightarrow\text{ c}(S)\neq\text{ c}(T)$

If condition $(i)$ or $(ii)$ happen, we are done with the proof. Otherwise we can assume $|\bar{S}|=|\bar{T}|$ and $\Gamma^{+}(\bar{s})=\Gamma^{+}(\bar{t})=k$ .

From Lemma 4 we state the following:

\text{lbl}(\bar{S})=\text{ lbl}(\bar{T})\Rightarrow\forall v\in V(\bar{S}),u\in V(\bar{T});

\text{ id}(v)=\text{ id}(u)\Rightarrow\Gamma^{+}(v)=\Gamma^{+}(u)

That is, if two vertices have the same identifier (and the T-DAGs they belong to have the same label), they must have the same outdegree.

Equivalently, if this does not hold, then two T-DAGs cannot have the same labeling. Formally,

\exists v\in V(\bar{S}),\exists u\in V(\bar{T})\text{ s.t. }\text{ id}(v)=\text{ id}(u)\text{ but }

\Gamma^{+}(v)\neq\Gamma^{+}(u)\Rightarrow\text{lbl}(\bar{S})\neq\text{lbl}(\bar{T})

Then, if we assume $(iii)$ holds,

$(iii)\Rightarrow|\bar{S}|=|\bar{T}|\wedge\Gamma^{+}(\bar{s})=\Gamma^{+}(\bar{t})=k$ we also have $\bar{S}\ncong\bar{T}\overset{L.~{}\ref{lemma:exist-diferent-outdegree},L.~{}\ref{lemma:outdegree-sequence}}{\Longrightarrow}\text{ lbl}(\bar{T})\neq\text{ lbl}(\bar{S})\Leftrightarrow\text{ c}(S)\neq\text{ c}(T)$

And we have proven that $T\ncong S\Rightarrow c(T)\neq c(S)$ . Thus, $c(T)=c(S)\Rightarrow T\cong S$ . ∎

To obtain the canonical labeling from any given T-DAG, we first obtain its isomorphism class representative and then the labeling induced from it. Figure 3 presents a bottom up strategy to obtain the class representative and the associated canonical labeling (it should be read from left to right, top to bottom). In the first four graphs, we perform a bottom up approach to reorder the T-DAG. Bold vertices are those already sorted and the dashed boxes mark which sets of siblings are to be sorted next. Finally, given the class representative ( $\bar{T}$ ), we provide the canonical labeling.

Using the constructions and the efficient labeling derived in this section, we can state our main theoretical result.

Theorem 1

The set of unknown inputs/outputs induces a family of subgraphs (patterns), within the Bitcoin User Network, which can be efficiently labeled and tested for isomorphism.

To the best of our knowledge, this is the first result that systematically addresses unknown transactions, which are often neglected in the current technical literature. In the next section, we will use our isomorphism algorithm to cluster TX T-DAGs that interact with the same patterns. Note that, while presented only for unknown transactions, our approach immediately extends to any transaction system where patterns can be modeled by T-DAGs.

V Experimental Results

In this section, we describe the methodology adopted to implement and test our model on the Bitcoin ledger and discuss the achieved results. Our approach consists of multiple steps. First, we parsed the Bitcoin blockchain according to the framework introduced in Section IV-A. Blocks and transactions information were retrieved by querying directly a Bitcoin full-node, importing data into a MySQL database. We then retrieved all the unknown transactions with a valid locking script from our database, and we applied our methodology to build TX T-DAGs and study their patterns. Finally, we clustered all the TX T-DAGs found in the previous step according to their isomorphism classes to group all similar patters that could have been created by the same entity. The above steps are described in more details in the following subsections.

V-A Database

Complete and reliable Bitcoin blockchain data are essential to correctly build the TX T-DAGs. The official software release used by the Bitcoin protocol, i.e. Bitcoincore, is not suitable for this purpose as it is not designed for the analysis of the data contained in the blockchain. To the best of our knowledge, all freely available blockchain explorer tools suffer from the problems described in Section IV-A. That is, they do not properly handle custom transactions. For this reason, we have designed a MySQL database to store transaction data retrieved by parsing the Bitcoin ledger using our model discussed in Section IV-A. We imported all the Bitcoin blockchain data starting from the genesis block 0 (mined on 2009-01-03), up to block 591,872 (mined on 2019-08-26). Our database consists of more than 448 million Bitcoin transactions, over 1.1 billion transaction inputs, over 1.2 billion transaction outputs and around 550 million Bitcoin addresses. We hosted our MySQL database on a Dell Poweredge R740 Server, equipped with 2 CPU Intel® Xeon® Gold 6144 3.5G, RAM 512GB, running Ubuntu Server 18.04.4 LTS. To parse the Bitcoin ledger, we used Bitcoin Core Daemon v.0.18.0.0, running on a Dell XPS laptop equipped with an Intel® Xeon® CPU E3-1505M 2.80GHz, RAM 32GB - OS: Ubuntu 18.04.4 LTS.

Number of

isomorphic

TX T-DAG

Height

Cardinality

Number

of edges

Number

of roots

29218

607

40002

40000

20000

…

2260

6878

7176

514

2073

4144

383

1568

3096

381

1059

1058

Table III: 10 most common isomorphism classes and some other more complex patterns.

V-B Unknown TX T-DAG Construction

After the parsing phase, our database contains all the transactions included in the Bitcoin ledger, both unknown and standard. At this point, we started analyzing the unknown ones, identifying around 22 million $\alpha$ -nodes that can generate an Unknown TX T-DAG (DAGs where the inner nodes are only unknown transaction outputs). We then built a forest by iterating Algorithm 1 for each $\alpha$ -node. Each weakly connected component of this forest represents an Unknown TX T-DAG originating from one or more alpha nodes. We used the library networkx 2.3 (together with the Python 3.5 interpreter) to create and manage the forest containing all the Unknown TX T-DAGs.

V-C Pruning Phase

By construction, Unknown TX T-DAGs represent transaction patterns in the blockchain network generated by unknown transactions. The root denotes the set of (standard) inputs that generated the pattern. Each inner node represents an unknown transaction output, i.e. an output with a Null value, that has been spent. Finally, each leaf represents either a known transaction output (with a valid Bitcoin address attached to it) or an unspent transaction output (either with a valid or a null address). Therefore, an unknown TX T-DAG of height 1 is trivial: in fact, such graph represents a single transaction with an unknown output that has not been spent. An output of this type could be an invalid/unspendable output or simply a custom but valid output that has not been spent in the considered blockchain portion. In the first case, since its locking script is malformed, this output is impossible to redeem. Therefore, the corresponding Unknown TX T-DAG can never grow further. In the second case, however, if the unknown output is spent in the future, our methodology will capture this event during the update of our database, and the associated Unknown TX T-DAG structure will be updated and considered for further analysis. Following this observation, we pruned the forest by dropping the weakly connected components of height 1 as they do not represent a relevant pattern for either $\mathcal{T}$ or $\mathcal{U}$ .
Finally, we obtained a forest with 803,782 nodes and 797,432 edges, having 30,333 weakly connected components left, i.e., our Unknown TX T-DAGs. These T-DAGs can be easily integrated into the transaction network $\mathcal{T}$ and the user network $\mathcal{U}$ . In this way, bot the blockchain analysis tools and the proposals in the literature which use these data structures will rely on complete information, never considered before.

V-D Isomorphism Detection

In the last step of our methodology, we clustered the 30,333 Unknown TX T-DAGs, obtained in the previous phase, according to their isomorphism classes. For each T-DAG, we built its isomorphism class representative (definition 14) by using the total ordering introduced in Section IV-C Before performing the clustering procedure, we augmented every TX T-DAGs with more than one root, such as the one depicted in Figure 4.2. In particular, for each graph of the cited type, we created a new root connected to each of the old ones. Consequently, each graph $g$ with multiple roots is converted into a new graph $g^{*}$ having a single root and the same nodes of $g$ plus one (the new root). Once all the Unknown TX T-DAGs were standardized to have a single root, we clustered them according to their isomorphism class representative. We identified 273 different isomorphism classes. Figure 4 shows the 10 most common classes, while Table III reports the height, cardinality, number of edges, and number of roots for the same classes and some other ones characterized by a more complex pattern.

V-E Discussion

Our approach is, to the best of our knowledge, the only one capable of identifying transaction inputs and outputs without relying on Bitcoin addresses, providing a framework to correctly handle unknown transactions. This feature allowed us to detect unknown transaction patterns not captured by state of the art blockchain analysis techniques. The 30,000+ Unknown TX T-DAGs discovered, can be easily integrated into the Bitcoin User Network, finally providing a complete transaction history, taking into account also unknown transactions. It is worth highlighting that even the simplest of the identified structures, such as the one in Figure 4.1, that appears more than 29,000 times in the Bitcoin ledger, is sufficient to make current parsers not to consider potentially relevant data. Indeed, although the transaction contains the root (the payer address) and the leaf (the recipient address), the User Network built without considering unknown transactions will not contain the middle node (unknown output), hence breaking the connection between the two users. Instead, integrating the User Network with our T-DAGs, automatically increases the accuracy of any Bitcoin clustering heuristic used so far, as well as any deanonymization technique, by simply considering never-used before data. In addition, our isomorphism classes could lead to new clustering techniques: non-trivial patterns, i.e. transaction schemes not originated from a normal user behavior, can be used to cluster services/entities that exhibit the same patterns. Other than the 10 most common isomorphism classes shown in Figure 4, we found several other patterns that deserve particular attention from a semantic point of view. As an example, we discovered an Unknown TX T-DAG of height 2260, having 6878 nodes, and 7176 edges. This complex pattern has 10 roots, i.e., is generated by 10 different standard transactions and therefore potentially by up to 10 different entities. Started on 2014-02-19, and concluded on 2014-12-08, such a pattern moved a total of 247.36 Bitcoins that, according to the historical Bitcoin prices, were worth about 92,500 US Dollars at the time of the last transaction (December 2014)—a complete analysis of the just introduced graph, and other interesting ones, will be provided in future work.
As a further step to refine our results, we have removed T-DAGs generated by trivial transactions, e.g., transactions redeemable by anyone, and T-DAGs generated by non-standard transactions with a known semantic, e.g., transactions generated by software bugs, network tests, and crypto challenges, just to name a few. The goal is to focus only on interesting T-DAGs, i.e., those that may have been created to hide malicious behavior, discarding those generated by known and legitimate activities. In order to perform this filtering, we checked the locking script at the root of each T-DAGs. In particular, we have created regular expressions for each non-standard locking script with trivial semantics observed in the literature. Then, we used these regular expressions to filter our T-DAGs.
After this step, we discarded the graphs listed below as generated by non-standard locking scripts observed in [8]:

•

2 T-DAGs generated by P2PKH NOP transactions, that have been probably used to test the OP_NOP operator;
•

2 T-DAGs generated by OP_MIN OP_EQUAL transactions, that anyone can easily unlock without any private key;
•

5 T-DAGs generated by Pay to Hash transactions, considered as “contest” in the network to find the correct value of the hash in the transactions;
•

1 T-DAGs generated by OP_IF transactions, that could be used to make a P2SH that can be unlocked by revealing only the redeem script;

Moreover, we discarded also the following graphs generated by trivial locking scripts never observed before:

•

2 T-DAGs generated with the OP_CHECKMULTISIG operator, that anyone can easily unlock without any private key;

The remaining 30,221 T-DAGs could be generated either by an unknown transaction never observed before or by a non-standard transaction known in the literature but with non-trivial semantics. Such graphs could be linked to malicious behavior and deserve a careful investigation in future work.

VI Conclusion and Future Work

We have shown that the current assumption that each transaction output has an address attached to it, is false. We have identified transactions that violate the cited assumption, labelling them unknown transactions. These unknown transactions imply, among other, that current clustering techniques are incomplete. Starting from the above observation, we proposed a theoretical model rooted on sound graph theory to detect, study, and classify patterns of unknown transactions. Exploring the Bitcoin network via our new tool we unveiled non-trivial classes of transaction patterns never considered before. We were able to identify over 30,000 unknown TX T-DAGs. Each of them represents a money flow that is invisible to standard parsing techniques. Some of these patterns show a high level of sophistication, with a complex topology potentially associated with automated payment services. Our novel approach to the Bitcoin graph opens up a brand new vein of research. For instance, the semantic associated to the discovered patterns is still to be explored. Furthermore, by extending the theoretical model to every transaction (using for example flow theory to remove cycles in the user network or, in general, focusing on transaction patterns that are represented by T-DAGs), it could be possible to study and cluster other types of non-standard behaviors within the Bitcoin environment. Moreover, our solution could be adapted to detect anomalous behaviour in the flourishing field of blockchain-based applications. In conclusion, we believe that the contribution provided in this work, from both a theoretical and practical point of view, other than being interesting on their own—providing for the first time a complete view of the bitcoin blockchain—, also pave the way for further research and applications in the Bitcoin domain and its spin-off technologies.

Acknowledgment

This publication was partially supported by the award NPRP 11S-0109-180242 from the Qatar National Research Fund (QNRF), a member of The Qatar Foundation. The information and views set out in this publication are those of the authors and do not necessarily reflect the official opinion of the QNRF.

References

[1] E. Androulaki, G. O. Karame, M. Roeschlin, T. Scherer, and S. Capkun, “Evaluating user privacy in bitcoin,” in Financial Cryptography and Data Security. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 34–51.
[2] A. Antonopoulos, Mastering Bitcoin, 2nd ed. 5 St George’s Yard Farnham, Surrey: O’Reily, 2017.
[3] L. Babai, “Graph isomorphism in quasipolynomial time [extended abstract],” in Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, ser. STOC ’16. New York, NY, USA: ACM, 2016, pp. 684–697. [Online]. Available: http://doi.acm.org.recursos.biblioteca.upc.edu/10.1145/2897518.2897542
[4] L. Babai, D. Y. Grigoryev, and D. M. Mount, “Isomorphism of graphs with bounded eigenvalue multiplicity,” in Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing, ser. STOC ’82. New York, NY, USA: ACM, 1982, pp. 310–324. [Online]. Available: http://doi.acm.org/10.1145/800070.802206
[5] L. Babai, W. Kantor, and E. Luks, “Computational complexity and the classification of finite simple groups,” in 24th Annual Symposium on Foundations of Computer Science (sfcs 1983). Tucson, AZ: IEEE, 1983, pp. 162–171.
[6] L. Babai and E. M. Luks, “Canonical labeling of graphs,” in Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, ser. STOC ’83. New York, NY, USA: ACM, 1983, pp. 171–183. [Online]. Available: http://doi.acm.org.recursos.biblioteca.upc.edu/10.1145/800061.808746
[7] M. Bartoletti and L. Pompianu, “An analysis of bitcoin op_return metadata,” in International Conference on Financial Cryptography and Data Security, Springer. Cham: Springer International Publishing, 2017, pp. 218–230.
[8] S. Bistarelli, I. Mercanti, and F. Santini, “An analysis of non-standard transactions,” Frontiers in Blockchain, vol. 2, p. 7, 2019. [Online]. Available: https://www.frontiersin.org/article/10.3389/fbloc.2019.00007
[9] Bitcoin Project 2009-2020, “Bitcoin development,” https://bitcoin.org/en/development, accessed: March 2021.
[10] ——, “What is a full node?” https://bitcoin.org/en/full-node, accessed: March 2021.
[11] S. R. Buss, “Alogtime algorithms for tree isomorphism, comparison, and canonization,” in Computational Logic and Proof Theory. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 18–33.
[12] G. G. Dagher, B. Bünz, J. Bonneau, J. Clark, and D. Boneh, “Provisions: Privacy-preserving proofs of solvency for bitcoin exchanges,” in Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15. New York, NY, USA: ACM, 2015, pp. 720–731. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813674
[13] P. T. Darga, M. Liffiton, K. Sakallah, and I. Markov, “Exploiting structure in symmetry detection for cnf,” in Proceedings of the 41st Annual Design Automation Conference, ser. DAC ’04. New York, NY, USA: ACM, 2004, pp. 530–534. [Online]. Available: http://doi.acm.org.recursos.biblioteca.upc.edu/10.1145/996566.996712
[14] D. Ermilov, M. Panov, and Y. Yanovich, “Automatic bitcoin address clustering,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). Cancun, ROO ,77500, Mexico: IEEE, Dec 2017, pp. 461–466.
[15] A. Gervais, G. O. Karame, K. Wüst, V. Glykantzis, H. Ritzdorf, and S. Capkun, “On the security and performance of proof of work blockchains,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’16. New York, NY, USA: ACM, 2016, pp. 3–16. [Online]. Available: http://doi.acm.org/10.1145/2976749.2978341
[16] A. Gervais, H. Ritzdorf, G. O. Karame, and S. Capkun, “Tampering with the delivery of blocks and transactions in bitcoin,” in Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15. New York, NY, USA: ACM, 2015, pp. 692–705. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813655
[17] M. Harrigan and C. Fretter, “The unreasonable effectiveness of address clustering,” in 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). Toulouse, France: IEEE, July 2016, pp. 368–373.
[18] J. Hopcroft, A. Aho, and J. Ullman, The Design and Analysis of Computer Algorithms. Reading, Massachusetts: Addison-Wesley, 1974.
[19] H. A. Jawaheri, M. A. Sabah, Y. Boshmaf, and A. Erbad, “Deanonymizing tor hidden service users through bitcoin transactions analysis,” Computers & Security, vol. 89, p. 101684, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404818309908
[20] T. Junttila and P. Kaski, “Engineering an efficient canonical labeling tool for large and sparse graphs,” in 2007 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX). Astor Crowne Plaza, New Orleans, Louisiana: SIAM, 2007, pp. 135–149.
[21] H. Kalodner, S. Goldfeder, A. Chator, M. Möser, and A. Narayanan, “BlockSci: Design and applications of a blockchain analysis platform,” ArXiv e-prints, vol. abs/1709.02489, Sep. 2017.
[22] L. Kiffer, R. Rajaraman, and a. shelat, “A better method to analyze blockchain consistency,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’18. New York, NY, USA: ACM, 2018, pp. 729–744. [Online]. Available: http://doi.acm.org/10.1145/3243734.3243814
[23] D. Kondor, M. Pósfai, I. Csabai, and G. Vattay, “Do the rich get richer? an empirical analysis of the bitcoin transaction network,” PloS one, vol. 9, no. 2, p. e86197, 2014.
[24] Y. Kwon, D. Kim, Y. Son, E. Vasserman, and Y. Kim, “Be selfish and avoid dilemmas: Fork after withholding (faw) attacks on bitcoin,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17. New York, NY, USA: ACM, 2017, pp. 195–209. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134019
[25] M. Lischke and B. Fabian, “Analyzing the bitcoin network: The first four years,” Future Internet, vol. 8, no. 1, 2016. [Online]. Available: http://www.mdpi.com/1999-5903/8/1/7
[26] E. Luks, “Isomorphism of graphs of bounded valence can be tested in polynomial time,” in 21st Annual Symposium on Foundations of Computer Science (sfcs 1980). Syracuse, NY: IEEE, Oct 1980, pp. 42–49.
[27] D. Maesa, A. Marino, and L. Ricci, “Uncovering the bitcoin blockchain: An analysis of the full users graph,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Alberta, Canada: IEEE, Oct 2016, pp. 537–546.
[28] D. D. F. Maesa, A. Marino, and L. Ricci, “An analysis of the bitcoin users graph: inferring unusual behaviours,” in Studies in Computational Intelligence. Springer International Publishing, Nov. 2016, pp. 749–760. [Online]. Available: https://doi.org/10.1007/978-3-319-50901-3_59
[29] B. McKay and A. Piperno, “Practical graph isomorphism, II,” CoRR, vol. abs/1301.1493, 2013. [Online]. Available: http://arxiv.org/abs/1301.1493
[30] B. D. McKay, “Computing automorphisms and canonical labellings of graphs,” in Combinatorial Mathematics, D. A. Holton and J. Seberry, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1978, pp. 223–232.
[31] S. Meiklejohn, M. Pomarole, G. Jordan, K. Levchenko, D. McCoy, G. Voelker, and S. Savage, “A fistful of bitcoins: Characterizing payments among men with no names,” in Proceedings of the 2013 Conference on Internet Measurement Conference, ser. IMC ’13. New York, NY, USA: ACM, 2013, pp. 127–140. [Online]. Available: http://doi.acm.org/10.1145/2504730.2504747
[32] A. P. Motamed and B. Bahrak, “Quantitative analysis of cryptocurrencies transaction graph,” Applied Network Science, vol. 4, no. 1, Dec. 2019. [Online]. Available: https://doi.org/10.1007/s41109-019-0249-6
[33] M. Muzychuk, “A solution of the isomorphism problem for circulant graphs,” Proceedings of the London Mathematical Society, vol. 88, no. 1, pp. 1–41, 2004.
[34] J. D. Nick, “Data-Driven De-Anonymization in Bitcoin,” Master’s thesis, ETH Zurich, 2015.
[35] M. Ober, S. Katzenbeisser, and K. Hamacher, “Structure and anonymity of the bitcoin transaction graph,” Future Internet, vol. 5, no. 2, pp. 237–250, 2013. [Online]. Available: http://www.mdpi.com/1999-5903/5/2/237
[36] S. Pemmaraju and S. Skiena, Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Cambridge: Cambridge University Press, 2003.
[37] F. Reid and M. Harrigan, An Analysis of Anonymity in the Bitcoin System. New York, NY: Springer New York, 2013, pp. 197–223. [Online]. Available: https://doi.org/10.1007/978-1-4614-4139-7_10
[38] C. SecTech, “Notpetya attack,” https://cyber-sectech.fandom.com/wiki/NotPetya_Attack, 2017, accessed: March 2021.
[39] M. Spagnuolo, F. Maggi, and S. Zanero, “Bitiodine: Extracting intelligence from the bitcoin network,” in International Conference on Financial Cryptography and Data Security. Springer, 2014, pp. 457–468.
[40] Wikipedia, “Wannacry ransomware attack,” https://en.wikipedia.org/wiki/WannaCry_ransomware_attack, 2019, accessed: March 2021.
[41] C. Yun, Y. Yirong, and M. Richard, “Canonical forms for labelled trees and their applications in frequent subtree mining,” Knowledge and Information Systems, vol. 8, no. 2, pp. 203–234, Aug 2005. [Online]. Available: https://doi.org/10.1007/s10115-004-0180-7

A Novel Framework for the Analysis of Unknown Transactions in Bitcoin: Theory, Model, and Experimental Results

Abstract

I Introduction

I-A Contribution

II Background

II-A Preliminaries

II-B Bitcoin Transaction Network and User Network

III Related Work

III-A Bitcoin Addresses Clustering

III-B Graph Isomorphism

IV Unknown Transactions Recognition

IV-A Unknown Transactions and Working Framework

Definition 1 (Unknown Transaction)

Definition 2 (TIO)

Definition 3 (Traveling the Blockchain)

Locking Script

IV-B Unknown TX T-DAG Construction

Definition 4 (TIO graph)

Lemma 1.

Proof.

Definition 5 (α\alpha-nodes)

Remark 1.

Definition 6 (Contracted TIO graph)

Remark 2.

Definition 7 (Termination application)

Definition 8 (Unknown TX graph)

Definition 9 (T-DAG)

Remark 3.

Remark 4.

Definition 10 (Set of Unknown TX T-DAGs)

T-DAG Compression

IV-C Chain Abstractions and Post-Processing

A total ordering for T-DAG isomorphism classes

Definition 11 (≺\prec - relation)

Definition 12 (≡\equiv - equality)

Lemma 2.

Proof.

Remark 5.

Remark 6.

Proof.

Canonical labeling for T-DAGs

Definition 13 (Δ\Delta-operator)

Definition 14 (T-DAG isomorphism classes representative)

Lemma 3.

Proof.

Remark 7.

Proof.

Definition 15 (T-DAG labeling)

Definition 16 (T-DAG canonical labeling)

Definition 17 (Labeling clause)

Lemma 4.

Lemma 5.

Lemma 6.

Proof.

Lemma 7.

Proof.

Theorem 1

V Experimental Results

V-A Database

V-B Unknown TX T-DAG Construction

V-C Pruning Phase

V-D Isomorphism Detection

V-E Discussion

VI Conclusion and Future Work

Acknowledgment

References

Definition 5 ( $\alpha$ -nodes)

Definition 11 ( $\prec$ - relation)

Definition 12 ( $\equiv$ - equality)

Definition 13 ( $\Delta$ -operator)