This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Georgia Institute of Technology, USA [email protected] Georgia Institute of Technology, USA [email protected] Georgia Institute of Technology, USA [email protected] \ccsdesc[100] \hideLIPIcs\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Advanced Graph-Based Deep Learning for Probabilistic Type Inference

Fangke Ye    Jisheng Zhao    Vivek Sarkar
Abstract

Dynamically typed languages such as JavaScript and Python have emerged as the most popular programming languages in use today. However, when possible to do so, there are also important benefits that accrue from including static type annotations in dynamically typed programs, e.g., improved documentation, improved static analysis of program errors, and improved code optimization. This approach to gradual typing is exemplified by the TypeScript programming system which allows programmers to specify partially typed programs, and then uses static analysis to infer as many remaining types as possible. However, in general, static type inference is unable to infer all types in a program; and, in practice, the effectiveness of static type inference depends on the complexity of the program’s structure and the initial types specified by the programmer. As a result, there is a strong motivation for new approaches that can advance the state of the art in statically predicting types in dynamically typed programs, and that do so with acceptable performance for use in interactive programming environments.

Previous work has demonstrated the promise of probabilistic type analysis techniques that use deep learning methods such as recurrent neural networks and graph neural networks (GNNs) to predict types for variable declarations and occurrences. In this paper, we advance past work by introducing a range of graph-based deep learning models that operate on a novel type flow graph (TFG) representation. The TFG represents an input program’s elements as graph nodes connected with syntax edges and over-approximated data flow edges, and our GNN models are trained to predict the type labels in the TFG for a given input program.

We study different design choices for our GNN-based type inference system for the 100 most common types in our evaluation corpus, and show that our best GNN configuration for accuracy (R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}) achieves a top-1 accuracy of 87.76%. This outperforms the two most closely related deep learning type inference approaches from past work – DeepTyper with a top-1 accuracy of 84.62% and LambdaNet with a top-1 accuracy of 79.45%. Alternatively, we can state the error (100% - accuracy) for R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 0.80×\times that of DeepTyper and 0.60×\times that of LambdaNet. Further, the average inference throughput of R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 353.8 files/second, compared to 186.7 files/second for DeepTyper and 1,050.3 files/second for LambdaNet. If inference throughput is a higher priority, then the recommended model to use from our approach is the next best GNN configuration from the perspective of accuracy (R-GNNNS{}_{\mbox{\footnotesize{\it NS}}}) which achieved a top-1 accuracy of 86.89% and an average inference throughput of 1,303.9 files/second. In summary, our work introduces advances in graph-based deep learning that yield superior accuracy and performance to past work on probabilistic type analysis, while also providing a range of GNN models that could be applicable in the future to other graph structures used in program analysis beyond the TFG.

keywords:
Static Analysis, Type Inference, Machine Learning

1 Introduction

Dynamically typed languages such as JavaScript and Python have emerged as some of the most popular programming languages in use today, as evidenced by their popularity ranks on GitHub.111The State of the Octoverse: https://octoverse.github.com/. Compared with statically typed programs, dynamically typed programs are usually more succinct and easier to modify. However, when possible to do so, there are also important benefits that accrue from including static type annotations in dynamically typed programs, e.g., improved documentation and static analysis of program errors. The classical approach to address this problem is to perform static type inference, usually accomplished by applying control and data flow analyses to infer the relevant type lattices for expressions in a program. However, in general, static type inference is unable to infer all types in a dynamically typed program; and, in practice, the effectiveness of static type inference depends on the complexity of the program’s structure, e.g., use of implicit polymorphic variables, object properties, and dynamic type evaluation. An alternate gradual typing approach is exemplified by the TypeScript programming system which allows programmers to specify partially typed programs, and then uses static analysis to infer as many remaining types as possible. While an improvement over a fully static approach, gradual typing is also encumbered by the inherent challenges faced by static analysis. As a result, there is a strong motivation for new approaches that can advance the state of the art in statically predicting types in dynamically typed programs, and that do so with acceptable performance for use in interactive programming environments.

Previous work has demonstrated the promise of probabilistic type analysis techniques that use deep learning methods to predict types for variable declarations and occurrences. One approach from past work, DeepTyper, formulates type prediction as a sequence tagging problem and trains a recurrent neural network (RNN) model to solve that problem [11]. Another approach from past work, LambdaNet, employs an auxiliary analysis to infer type dependences between program elements and extract them into a graphical representation, on which it trains a graph neural network (GNN) model to perform type prediction [25]. The introduction of these approaches is a natural follow-on to work from the past two decades on applying machine learning to address a number of compilation-related problems, and that has been enabled by the rapid growth of open source online code repositories such as the open-source projects hosted on GitHub. Examples of programming tasks that have been aided by the use of machine learning include property prediction, code search, anomaly detection, and code completion [1].

In this paper, we advance past work by introducing a GNN-based type inference system consisting of a range of GNN models operating on a novel type flow graph (TFG) representation. The TFG represents an input program’s elements as graph nodes connected with syntax edges and over-approximated data flow edges, and our GNN models are trained to predict type labels from the TFG of a given input program. The TFG is efficient to construct, and can be built via simple bottom-up traversals of the program’s abstract syntax tree. Our GNN models learn to propagate type information on the graph and predict types for the graph nodes. We study different design choices for our GNN-based type inference system for the 100 most common types in our evaluation corpus, and show that our best GNN configuration for accuracy (R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}) achieves a top-1 accuracy of 87.76%. This outperforms the two most closely related deep learning type inference approaches from past work – DeepTyper with a top-1 accuracy of 84.62% and LambdaNet with a top-1 accuracy of 79.45%. Alternatively, we can state the error (100% - accuracy) for R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 0.80×\times that of DeepTyper and 0.60×\times that of LambdaNet. Further, the average inference throughput of R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 353.8 files/second, compared to 186.7 files/second for DeepTyper and 1,050.3 files/second for LambdaNet. If inference throughput is a higher priority, then the recommended model to use from our approach is the next best GNN configuration from the perspective of accuracy (R-GNNNS{}_{\mbox{\footnotesize{\it NS}}}) which achieved a top-1 accuracy of 86.89% and an average inference throughput of 1,303.9 files/second. Thus, our work introduces advances in graph-based deep learning that yield superior accuracy and performance to past work on probabilistic type analysis, while also providing a range of GNN models that could be applicable in the future to other graph structures used in program analysis beyond the TFG.

A summary of the contributions of this paper is as follows:

  • We propose a probabilistic analysis framework for type inference based on a family of GNNs.

  • We introduce a lightweight type flow graph structure that efficiently captures type flow information from a program’s abstract syntax tree.

  • We propose several design choices for our family of GNNs, and evaluate their accuracy and efficiency as described above.

  • We compare our approach with two state-of-the-art deep learning based approaches for probabilistic type inference (DeepTyper and LambdaNet) and show that our approach yields superior accuracy and performance to both past approaches.

The rest of this paper is organized as follows. Section 2 provides background on type inference and graph neural networks. Section 3 introduces our GNN-based type inference framework, including graph construction and model architecture. In Section 4, we evaluate our approach with different design choices and compare it with previous work. Section 5 presents related work, and Section 6 concludes the paper.

2 Background

2.1 Static Type Inference

The goal of type inference is to automatically deduce the type of an expression in a program. The deduced type can be either partially or fully, depends on the simplicity of the target programming language or the robustness of the type inference analysis.

Static type inference algorithms infer type information via solving the type constraints collected from the target program. A type constraint represents the aggregation and reduction of the type information between expressions in the program. This allows type inference to be formulated as a graph analysis problem.

2.2 Graph Neural Networks

Graph neural networks (GNNs) are machine neural network models that directly operate on graph-structured data. A graph can be defined as G=(V,E)G=(V,E), where VV is the set of nodes and EE is the set of edges. The neighborhood of a node vv is defined as 𝒩(v)={uV|(u,v)E}\mathcal{N}(v)=\{u\in V\leavevmode\nobreak\ |\leavevmode\nobreak\ (u,v)\in E\}. A node vv can have attributes represented as a feature vector 𝐱v\mathbf{x}_{v}, and similarly, an edge (u,v)(u,v) can have attributes represented as a feature vector 𝐞uv\mathbf{e}_{uv}. GNNs learn the representations of graph structures by propagating the states of nodes to their neighbors iteratively and transforming the nodes states into output representations. Let 𝐡v\mathbf{h}_{v} be the state vector of node vv and KK be the number of propagation steps,222In this paper, we use the term propagation step to represent the computation in an iteration. A propagation step is also called a GNN  layer in the literature. a general form of a GNN can be defined as:333Although there exists a more general form that allows edge features to be updated during the propagation [6], the one we present here is sufficient to formulate all the GNN architectures we will discuss in this paper.

𝐡v(0)\displaystyle\mathbf{h}_{v}^{(0)} =𝐱v\displaystyle=\mathbf{x}_{v} vV\displaystyle v\in V (1)
𝐦uv(k)\displaystyle\mathbf{m}_{uv}^{(k)} =fmessage(k)(𝐡u(k1),𝐞uv)\displaystyle=f_{\mathrm{message}}^{(k)}\left(\mathbf{h}_{u}^{(k-1)},\mathbf{e}_{uv}\right) k=1..K,vV,u𝒩(v)\displaystyle k=1..K,v\in V,u\in\mathcal{N}(v) (2)
𝐚v(k)\displaystyle\mathbf{a}_{v}^{(k)} =faggregate(k)(𝐡v(k1),{muv(k)|u𝒩(v)})\displaystyle=f_{\mathrm{aggregate}}^{(k)}\left(\mathbf{h}_{v}^{(k-1)},\left\{m_{uv}^{(k)}\leavevmode\nobreak\ \middle|\leavevmode\nobreak\ u\in\mathcal{N}(v)\right\}\right) k=1..K,vV\displaystyle k=1..K,v\in V (3)
𝐡v(k)\displaystyle\mathbf{h}_{v}^{(k)} =fupdate(k)(𝐡v(k1),𝐚v(k))\displaystyle=f_{\mathrm{update}}^{(k)}\left(\mathbf{h}_{v}^{(k-1)},\mathbf{a}_{v}^{(k)}\right) k=1..K,vV\displaystyle k=1..K,v\in V (4)

The state of each node 𝐡v\mathbf{h}_{v} is initialized with its feature vector 𝐱v\mathbf{x}_{v} in Equation 1. At step kk, a node receives messages from its neighbors through corresponding edges. A message vector 𝐦uv\mathbf{m}_{uv} is generated from a neighbor’s hidden state 𝐡u(k1)\mathbf{h}_{u}^{(k-1)} at step (k1)(k-1) and the feature vector of the edge where the message is passed through (i.e., 𝐞uv\mathbf{e}_{uv}), using the message function fmessage(k)f_{\mathrm{message}}^{(k)} in Equation 2. The messages are then aggregated to a single vector using the aggregation function faggregate(k)f_{\mathrm{aggregate}}^{(k)} in Equation 3. At the end of step kk, the node’s state 𝐡v(k)\mathbf{h}_{v}^{(k)} is updated from its state at the previous step 𝐡v(k1)\mathbf{h}_{v}^{(k-1)} and the aggregated message vector 𝐚v(k)\mathbf{a}_{v}^{(k)} using the update function fupdate(k)f_{\mathrm{update}}^{(k)} in Equation 4.

The functions fmessage(k)f_{\mathrm{message}}^{(k)}, faggregate(k)f_{\mathrm{aggregate}}^{(k)}, fupdate(k)f_{\mathrm{update}}^{(k)} are neural networks containing trainable parameters. Those functions can be the same and share parameters across different steps (i.e., f_(1)=f_(2)==f_(K)f_{\_}^{(1)}=f_{\_}^{(2)}=\ldots=f_{\_}^{(K)}) or they can be different functions and use a separate set of parameters for each step. Following the taxonomy in [26], we use the term recurrent graph neural networks to refer to GNNs with shared functions and parameters for all steps, and convolutional graph neural networks to refer to GNNs with separate functions and parameters for different steps.

2.3 Machine Learning Assisted Static Analysis

There has been a large body of work that leverages machine learning to assist static analysis. By their tasks, those approaches can be categorized into the following categories:

Program Property Prediction. In this class of tasks, the goal is to assign property labels to (part of) a program. The typical approach is to apply supervised learning to build a model that predicts the property labels from a program’s representation. The program representations that have been adopted by previous work include token sequences [11], abstract syntax trees [4, 21], and some more complex graphs [2, 23].

Learning to Parametric Program Analysis. To make a trade-off between precision and cost, parametric analyses have been applied in real-world solutions [28, 18]. Unlike manual parameterization and heuristic parameter search, which rely on the expertise of users or heuristic developers, statistical learning approaches can learn from automatically generated data and have shown a good capacity for finding the optimal configuration for parameters including the granularity of program abstraction [15, 19].

Program Specification Identification. This scenario focuses on mining program specification from code corpora, including the API specifications [17, 8], the behaviors of a library [5] and code idioms [3]. These techniques can be applied for detecting API misuses, performing code completion, etc.

In this paper, we focus on type inference, which can be categorized as a program property prediction problem. We investigate the approach of using graph structure to present the input program and building machine learning models to predict program properties. A  GNN-based learning framework is designed to learn from graph-structured data extracted from code and predict types for program elements.

3 Methodology

This section describes our methodology that uses GNN based machine learning framework to perform probabilistic type inference. Section 3.1 describes the problem statement and the proposed learning framework. Section 3.2 gives the definition of the graph we use as input to our GNN models. The design of the GNN architecture is introduced in Section 3.3.

3.1 Problem Statement

We formulate type inference as a graph node label prediction problem. The prediction is at the file level for better scalability and flexibility. Our GNN-based type inference framework is shown in Figure 1. For each source file, we construct a graph whose nodes correspond to program elements in the file. The graph edges include syntax edges extracted from the abstract syntax tree and over-approximated data flow edges derived from name-matching-based static analysis. A GNN model is trained to predict type for the graph nodes from a fixed type vocabulary. The target programming language we choose is JavaScript, and the unit of program element for prediction is the identifier, which covers variables, function parameters, and object properties in JavaScript. Following the methodology of previous work [11], we train the model on a dataset of TypeScript programs. The type labels are obtained from the TypeScript compiler, and the graphs are constructed from the code without using any information from existing type annotations.

Refer to caption
(a) Model training.
Refer to caption
(b) Type inference.
Figure 1: The graph neural network based type inference framework.

3.2 Graph Definition

A data-driven approach relies on high-quality data containing features that are relevant to the target problem. For the type inference problem, the relevant features include the source of type information (e.g., literals), the code patterns that have implications about types, and the paths of type propagation. Ideally, the input data to the learning model should at least cover those features. In this work, we defined a type flow graph (TFG) that carries those features mentioned above by encoding syntactic and approximate semantic information in a program.

A TFG is a directed graph whose nodes represent program elements in the code and edges encode type-relevant relationships between the nodes. It is constructed from a program’s abstract syntax tree (AST). We give the definitions of the nodes and edges and how they can be extracted in the following paragraphs.

3.2.1 Graph Node

The nodes in a TFG present the program elements that either carry types or provide hints about types. They are categorized into the following node types:

  • Identifier node (IdentNode): It represents an identifier token in a program and corresponds to an occurrence of a named element in the program, including variables, object properties, and functions.

  • Token node (TokNode): It represents a non-identifier token that appears as leaves in an AST. Most of the token nodes in a program are literals.

  • Expression node (ExprNode): It represents an expression in a program, including binary, unary expressions, function and variable declarations;

  • Variable symbol node (VarSymNode): It represents a variable and its occurrences within the live range;

  • Object property node (ObjPropNode): It represents an object property name in a program. It can be viewed as an over-approximation of an object property symbol;444This is equivalent to applying a flow-insensitive and context-insensitive alias analysis for object properties, which is over-approximated since it only checks if properties’ names are the same.

  • Context node (CtxNode): It represents the context where an expression appears. Every CtxNode is associated with an expression node in the AST whose parent is a statement node.

Each node has an associated feature. An IdentNode’s feature is its name and a TokNode’s feature is its token type. A CtxNode’s feature is a (statement_type, child_name) pair, where statement_type is the type of a statement node in the AST and child_name is the name of a child node of that statement node (e.g., for an if-statement node with an expression child node serving as its condition, the expression node in the TFG will be connected to a CtxNode with the feature (IfStmt, condition)). The features of other nodes are their node types.

IdentNodes and ExprNodes are the units for type prediction. TokNodes serve as sources of type information since they contain literals that have known types. A VarSymNode acts as a “hub” that helps the type information exchange between different occurrences of the same variable. And an ObjPropNode has a similar functionality for object properties that share the same name. A CtxNode provides hints about the type of an expression based on its role in the enclosing statement.

3.2.2 Graph Edge

The functionality of graph edges is to establish information flow paths between the nodes. To enable efficient information exchange, we make all edges bi-directional, and each edge type we describe below has a corresponding backward edge type in order to distinguish between a forward edge and a backward edge. The edges are categorized into the following types:

  • Expression edge (ExpEdge): It connects an IdentNode/TokNode/ExprNode to an ExprNode. The source and the destination of this edge should correspond to a child-parent pair in the AST.

  • Variable symbol edge (VarSymEdge): It connects an IdentNode to a VarSymNode. The IdentNode must represent a variable whose symbol corresponds to the VarSymNode.

  • Object property edge (ObjPropEdge): It connects an IdentNode to an ObjPropNode. The IdentNode must represent an object property whose name corresponds to the ObjPropNode.

  • Return edge (RetEdge): It connects an ExprNode representing a returned expression to an ExprNode representing its enclosing function declaration.

  • Call edge (CallEdge): It can connect an ExprNode representing a returned expression to an ExprNode representing a call expression which is a potential caller of the enclosing function of that returned expression. It can also connect an ExprNode representing an argument in a call expression to an ExprNode representing a parameter of a function which is a potential callee of that call expression.

  • Context edge (CtxEdge): It connects a CtxNode to an ExprNode it is associated with.

Each edge has an associated feature. The feature of an ExpEdge, presented as (c, p), and its backward edge, (p, c), is a (expression_type, child_name, direction) tuple, where expression_type is the expression type of p, child_name is the child name of c in the AST, and direction is either f for a forward edge and b for a backward edge. For other edges, their features are the edge types.

function foo(a) {
if (a.val) x = "Hello";
return x;
}
r.val = true;
let c = foo(r);
(a) Code example.
Refer to caption
(b) Abstract syntax tree.
Refer to caption
(c) Type flow graph.
Figure 2: An example of TFG.

Figure 2 gives an example of TFG. Figure 2(a) shows a piece of code and Figure 2(b) displays its AST. The corresponding TFG is shown in Figure 2(c). To simplify the presentation, the edge features for ExpEdges are not shown in this example. It can be seen that ExpEdges establish type information flow paths within expressions. VarSymEdges and ObjPropNodes allow type information exchange between the different occurrences of the same variable and object properties. RetEdges allow type information to flow between a function declaration and its return statements. CallEdges enable the information exchange between call sites and their potential callees.555 The call graph construction uses the same name matching mechanism as object properties’ and thus is also over-approximated. CtxEdges allow expression nodes to get type hints from their context within a statement.

3.2.3 Graph Extraction

The graph extraction is based on the traversal of the input program’s AST in a bottom-up manner. For each AST node that matches a TFG node pattern, the graph extractor creates the corresponding TFG node and edges based on the definitions described above. For CallEdge creation, the graph extractor runs a lightweight pre-pass to scan the program file to collect function declaration information for later use.

3.3 Machine Learning Model: Graph Neural Network

Here we introduce our machine learning model design for static type inference. This learning model is presented as a customizable system composed of a core GNN model with different building blocks and several other components that converts node/edge features into vector representations.

3.3.1 Overall Model Architecture

As shown in Figure 3, the learning system contains six components that are organized into three phases: (1) feature embedding for nodes and edges; (2) GNN message propagation; and (3) label prediction.

Refer to caption
Figure 3: Overall model architecture.

The feature embedding phase (components 1 and 2) translates node/edge features in the type flow graph to vector representations (i.e., embedding vectors). For edge feature embedding and the basic design of node feature embedding (demonstrated in Figure 4(a)), we assign a trainable vector to each feature in the vocabulary. Section 3.3.2 discusses several alternative methods for embedding identifier names, which are the features of IdentNodes.

In the GNN message propagation phase, the node feature embedding vectors are used as node initial states and the edge embedding vectors are used to generate messages on the corresponding edges. The GNN iteratively propagates and updates node states for KK steps using the message, aggregation, and update functions (components 3, 4, 5). Section 3.3.3 shows how we define those functions as components of our GNN architecture.

In the label prediction phase (component 6), we use a fully-connected layer followed by a softmax function to transform a node’s final state after KK steps of propagation into a probability distribution over candidate types and output the most probable type as the inferred type for a node.

3.3.2 IdentNode state initialization

Identifier names can sometimes provide hints on the types of associated program elements. For example, a variable named “num” usually has an integral type. We design our model to utilize such type information by treating identifier names as node features of corresponding IdentNodes and encode them into real-valued embedding vectors, which are used as the initial states of those IdentNodes. The GNN then iteratively propagates the type information encoded in the state vectors to other nodes. Figure 5 shows four methods of encoding identifier names that are adopted by our model. Figure 4(a) is a basic method that uses a single embedding layer that maps each unique name to a trainable parameter vector. Figures 4(b) and 4(c) demonstrate two techniques to improve the name encoding, which will be described in detail in the following paragraphs. The last method shown in Figure 5(a) is the combination of the two techniques.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(a)
Figure 5: IdentNode state initialization.
Name Segmentation

The simple embedding method in Figure 4(a) requires a fixed-size vocabulary of node identifier names. However, the size of such vocabulary can be huge due to the diversity of identifier names in programs, making it difficult to learn the information encoded in the names. Previous works have shown that segmenting identifier names can help reduce the vocabulary size and boost the performance of machine learning models in code modeling [13, 2]. Similar to those prior works, we introduce a name segmentation mechanism shown in Figure 4(b). First, an identifier name is split into several segments. Then each of the segments is embedded into a vector through an embedding layer. Finally, a bi-directional RNN is applied to the sequence of segment embedding vectors of an identifier name to generate a single encoding vector, which is used as the initial state vector of the IdentNode corresponding to the identifier name. We perform name segmentation by first splitting an identifier name into subtokens according to the CamelCase and snake_case patterns, and then further segmenting the subtokens with byte pair encoding [10, 22].

Contextualized Initialization

The context (surrounding tokens) of an identifier name can also provide useful type information, and thus it could be beneficial to encode it into the initial state of an IdentNode. Although the propagation on the type flow graph can help gather information from the context, the size of such context is limited by the number of GNN propagation steps, which is usually small. As shown in Figure 4(c), to encode the information from a broader range of context for IdentNodes, we introduce a bi-directional RNN layer that takes the sequence of code token embedding vectors as input and generates an output vector for each token. Those output vectors that correspond to identifiers are used as the initial states of IdentNodes. We name this RNN layer as the contextual layer.

3.3.3 Graph Neural Network Architecture

The GNN architecture we designed follows the general framework described in Equations 1, 2, 3 and 4. We describe the instantiation of the framework in the following paragraphs. Unless otherwise specified, we define the functions below as components of recurrent GNNs and thus do not distinguish between parameters in different propagation steps.

Message Function

The functionality of the message function is to produce a message vector 𝐦uvdh\mathbf{m}_{uv}\in\mathbb{R}^{d_{h}} that passed from uu to its neighbor vv that encodes the type information of uu. In TFG, we introduce a feature for each edge that is embedded to vector 𝐞uvde\mathbf{e}_{uv}\in\mathbb{R}^{d_{e}}. The edge feature embedding vector is integrated in to the message vector by the following message function:

𝐦uv(k)=𝐖MO((𝐖MI𝐡u(k1)+𝐛MI)𝐞uv)+𝐛MO\displaystyle\mathbf{m}_{uv}^{(k)}=\mathbf{W}_{MO}\left(\left(\mathbf{W}_{MI}\mathbf{h}_{u}^{(k-1)}+\mathbf{b}_{MI}\right)\odot\mathbf{e}_{uv}\right)+\mathbf{b}_{MO} vV,u𝒩(v)\displaystyle v\in V,u\in\mathcal{N}(v) (5)

𝐖MIde×dh\mathbf{W}_{MI}\in\mathbb{R}^{d_{e}\times d_{h}}, 𝐖MOdh×de\mathbf{W}_{MO}\in\mathbb{R}^{d_{h}\times d_{e}}, 𝐛MIde\mathbf{b}_{MI}\in\mathbb{R}^{d_{e}}, and 𝐛MOdh\mathbf{b}_{MO}\in\mathbb{R}^{d_{h}} are learnable parameters; \odot stands for element-wise multiplication.

Alternatively, we can ignore edge features in the TFG and use the following identity function as the message function:

𝐦uv(k)=𝐡u(k1)\displaystyle\mathbf{m}_{uv}^{(k)}=\mathbf{h}_{u}^{(k-1)} vV,u𝒩(v)\displaystyle v\in V,u\in\mathcal{N}(v) (6)
Aggregation Function

After the calculation of message vector 𝐦uv\mathbf{m}_{uv} for all u𝒩(v)u\in\mathcal{N}(v), the aggregation function aggregates those message vectors for vv into a vector 𝐚vdh\mathbf{a}_{v}\in\mathbb{R}^{d_{h}}. A simple solution is to apply a mean aggregation shown below:

𝐚v(k)=1|𝒩(v)|u𝒩(v)𝐦uv(k)\displaystyle\mathbf{a}_{v}^{(k)}=\frac{1}{|\mathcal{N}(v)|}\sum_{u\in\mathcal{N}(v)}\mathbf{m}_{uv}^{(k)} vV\displaystyle v\in V (7)

During type information propagation on the graph, it is likely the neighbors do not pass equally important/useful information to a node. For example, the type information coming from a literal node may be more informative than that coming from an identifier node. Consider this simple example: x = "hello" + y. The string literal node "hello" presents stronger information than identifier node y. Thus it could be beneficial to assign different weights to messages from different neighbors during aggregation.

To this end, we introduce an alternative aggregation process that contains an attention mechanism to weigh the importance of different incoming messages. It is adapted from the one in graph attention networks [24] and is shown in the following equations:

αuv(k)\displaystyle\alpha_{uv}^{(k)} =exp(LeakyReLU(𝐰[𝐖QK𝐡v(k1);𝐖QK𝐦uv(k)]))u𝒩(v)exp(LeakyReLU(𝐰[𝐖QK𝐡v(k1);𝐖QK𝐦uv(k)]))\displaystyle=\frac{\exp\left(\mathrm{LeakyReLU}\left(\mathbf{w}^{\top}\left[\mathbf{W}_{QK}\mathbf{h}_{v}^{(k-1)};\mathbf{W}_{QK}\mathbf{m}_{uv}^{(k)}\right]\right)\right)}{\sum_{u^{\prime}\in\mathcal{N}(v)}\exp\left(\mathrm{LeakyReLU}\left(\mathbf{w}^{\top}\left[\mathbf{W}_{QK}\mathbf{h}_{v}^{(k-1)};\mathbf{W}_{QK}\mathbf{m}_{u^{\prime}v}^{(k)}\right]\right)\right)} vV,u𝒩(v)\displaystyle v\in V,u\in\mathcal{N}(v) (8)
𝐚v(k)\displaystyle\mathbf{a}_{v}^{(k)} =u𝒩(v)αuv(k)𝐖V𝐦uv(k)\displaystyle=\sum_{u\in\mathcal{N}(v)}\alpha_{uv}^{(k)}\mathbf{W}_{V}\mathbf{m}_{uv}^{(k)} vV\displaystyle v\in V (9)

𝐖QKdh×dh\mathbf{W}_{QK}\in\mathbb{R}^{d_{h}\times d_{h}}, 𝐖Vdh×dh\mathbf{W}_{V}\in\mathbb{R}^{d_{h}\times d_{h}} and 𝐰2dh\mathbf{w}\in\mathbb{R}^{2d_{h}} are parameters to be learned. LeakyReLU\mathrm{LeakyReLU} is a non-linear activation function [27].

Update Function

In type inference, type information could be propagated many steps across multiple expressions or even procedure boundaries. To enable effective learning from the potential long-term dependencies, our recurrent GNN architectures integrate the gated recurrent unit (GRU) [9] as our update function. This update function has also been adopted in other recurrent GNN architectures [14]. The update function can be written as:

𝐡v(k)=GRU(𝐚v(k),𝐡v(k1))\displaystyle\mathbf{h}_{v}^{(k)}=\mathrm{GRU}\left(\mathbf{a}_{v}^{(k)},\mathbf{h}_{v}^{(k-1)}\right) vV\displaystyle v\in V (10)

Our GNN architecture can also be configured to be a convolutional GNN, i.e., parameters in functions defined above do not share between propagation steps. In this case, the update function will not be a recurrent unit, but the one specified below:

𝐡v(k)=ReLU(𝐖h(k)𝐡v(k1)+𝐛v(k))\displaystyle\mathbf{h}_{v}^{(k)}=\mathrm{ReLU}\left(\mathbf{W}_{h}^{(k)}\mathbf{h}_{v}^{(k-1)}+\mathbf{b}_{v}^{(k)}\right) vV\displaystyle v\in V (11)

𝐡v(k)dh×dh\mathbf{h}_{v}^{(k)}\in\mathbb{R}^{d_{h}\times d_{h}} and 𝐛v(k)dh\mathbf{b}_{v}^{(k)}\in\mathbb{R}^{d_{h}} are learnable parameters of step kk.

As an exception, we do not update the states for TokNodes and CtxNodes (i.e., their update function is the identity function). This is because the type information they contain is known in advance and thus no information update is needed.

4 Evaluation

In this section, we evaluate our GNN based type inference system to answer the following research questions:

  • RQ.1: What are the impacts of our GNN model design choices on the accuracy of type inference?

  • RQ.2: How efficient is our GNN-based type inference?

  • RQ.3: How does our approach compare with state-of-the-art deep learning type inference approaches?

Experimental Platform

We conducted our experiments on a server that has two Intel Xeon E5-2623 v4 processors with 128GB of RAM and an NVIDIA Tesla V100 GPU with 16GB of graphics memory. The neural network models were trained and evaluated on the GPU.

Dataset and Preprocessing

We constructed our dataset based on a corpus consisting of popular TypeScript projects in Github that were also used in past work on type inference for JavaScript [11]. We obtained the projects that are still publicly available from that set, and, following the methodology in [11], removed all files containing more than 5,000 tokens so as to enable efficient batching. The resulting set of projects was then randomly split into training, validation and test subsets, which respectively contain 789, 99 and 99 projects, or 74,801, 3,729 and 3,838 files.

Our toolchain was used to construct a type flow graph (TFG) for each file. Figure 6 shows the distributions of the TFG size in the training, validation, and test sets, in terms of numbers of nodes and edges. Note that TypeScript type annotations are not used when constructing the TFGs.

Refer to caption
Figure 6: Distributions of the numbers of nodes/edges in a graph for the training, validation, and test sets which contain 74,801, 3,729 and 3,838 files/graphs respectively, shown as box-and-whiskers plots. Each box spans values from the first quartile (Q1Q1) to the third quartile (Q3Q3), with the median shown as a horizontal line in the box. The endpoint whiskers indicate the values, Q11.5×(Q3Q1)Q1-1.5\times(Q3-Q1) and Q3+1.5×(Q3Q1)Q3+1.5\times(Q3-Q1), for the respective datasets.

To identify type labels in the training set, we used the TypeScript compiler to extract the types for all nodes in our TFGs, including identifier nodes and other expression nodes. Since TypeScript permits partial type annotations, these type labels can include types provided by the developer as well as types inferred by the TypeScript compiler. The extracted types were then preprocessed with the following steps: (1) function types are mapped to their return types; (2) the type parameters in parametric types are removed (e.g., Array<number> becomes Array); (3) literal types are mapped to their base types (e.g., the literal type "a"|"b" becomes string); (4) types with names that consist of a single character are filtered out, as they usually represent type parameters (e.g., T, U, V). A type vocabulary was built after the preprocessing, and the models were trained to predict types from this vocabulary. Similar to [25], we picked the top-100 frequent types from the training set, excluding the any type. We did not use a larger type vocabulary because our focus is on predicting types that are used across different projects rather than types defined locally in a project. Prediction of locally defined types is beyond the scope of this paper/

We constructed fixed-sized vocabularies for graph edge features, non-name node features, and identifier names (or name segments) for them to be embedded as trainable vectors in the neural networks. We included all edge features in the training set in the edge vocabulary. The resulting edge feature vocabulary has a size of 210. We also constructed a vocabulary of non-name features in the same manner and its size is 184. For names, we constructed a vocabulary of full names when we do not perform name segmentation, and a vocabulary of name segments otherwise. The full name vocabulary contains the 10,000 most frequent identifier names in the training set. We also include a special UNKNOWN token to represent out-of-vocabulary names. For name segmentation, we generated the segment vocabulary by performing up to 10,000 merges during byte pair encoding on the training set.

When performing inference on files in the validation and test sets, we followed the methodology adopted by prior work [11, 25], and only used labels available in type annotations provided by the developer (and ignored the types inferred by the TypeScript compiler). Doing so only assigns labels to identifier nodes. We then repeat steps (1) to (4) listed above for these extracted types in the validation and test sets. We also removed edges with out-of-vocabulary edge features from the type flow graphs built from the validation and test sets.

Evaluated Model Designs

As mentioned in Section 3.3.3, we studied different GNN model design choices, including the setup for node state initialization and the GNN architecture. The motivation for evaluating different model designs is to gain insight as to which is the best learning system for type inference, and to answer RQ.1: “What are the impacts of our GNN model design choices on the accuracy of type inference?”

Table 1 lists the various GNN designs (along with their different design options) that are encompassed by our evaluation. R-GNN is our baseline model, which is a recurrent GNN (i.e., a GNN that shares parameters across propagation steps) that uses the GRU update function and mean message aggregation. It takes in edge features but does not perform name segmentation or contextualized initialization. C-GNN is a variant of R-GNN that does not share parameters across steps and uses the update function defined in Equation 11. R-GAT extends R-GNN with the attention-based message aggregation function defined in Equations 8 and 9. R-GNNNS{}_{\mbox{\footnotesize{\it NS}}} introduces name segmentation in the node initial state generation phase to R-GNN. Similarly, R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}} introduces the contextual layer to R-GNN. R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} adds both name segmentation and the contextual layer into R-GNN. We observe that none of these eight GNN variants have been studied before in past work on type inference.

All of the GNN designs described above are based on the TFG with edge features. To evaluate the impact of not including edge features in TFGs, and the effectiveness of the attention mechanism in such situations, we built two more models that do not take edge features as input. They are R-GNNNEF{}_{\mbox{\footnotesize{\it NEF}}} and R-GATNEF{}_{\mbox{\footnotesize{\it NEF}}}, which correspond to R-GNN and R-GAT, respectively.

Table 1: Designs of GNN Architectures.
GNN Type Attention Name Segmentation Contextual Layer Edge Features
C-GNN Convolutional \checkmark
R-GNN Recurrent \checkmark
R-GAT Recurrent \checkmark \checkmark
R-GNNNS{}_{\mbox{\footnotesize{\it NS}}} Recurrent \checkmark \checkmark
R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}} Recurrent \checkmark \checkmark
R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} Recurrent \checkmark \checkmark \checkmark
R-GNNNEF{}_{\mbox{\footnotesize{\it NEF}}} Recurrent
R-GATNEF{}_{\mbox{\footnotesize{\it NEF}}} Recurrent \checkmark
Evaluation Metrics

We evaluated the models in terms of their type prediction accuracy and efficiency. The accuracy metrics include top-1 and top-5 accuracy for predicting types for the test set. Those two kinds of accuracy metrics are obtained by using the model to output the top-1 and top-5 probable types for each prediction location and then computing how often the ground truth type label matches the output type(s). We divided types in the vocabulary into two categories: the 10 most frequent types in the training set and the other 90 types, and measured the top-1 and top-5 accuracy for all 100 types and the two categories separately. The efficiency metrics we used are the number of parameters in the model and its throughput (files/sec) for inference.

Implementation Details

We used the TypeScript compiler to extract type labels and generate ASTs for source code files in the dataset. Our TFGs were constructed from those ASTs. The neural network architectures used in our experiments were built using PyTorch [20]. In all experiments, we fixed the batch size to be 64 and trained the models for 60 epochs using the AdamW optimizer [16] with a learning rate of 10310^{-3}. We evaluated the model on the validation set after each epoch and used the model from the epoch that produced the lowest validation loss to be evaluated on the test set. We used 128-dimensional node type/name embeddings, 32-dimensional name segment embeddings, 32-dimensional RNN hidden states for name segment sequence encoding, 256-dimensional edge type embeddings, 128-dimensional node states, and 128-dimensional RNN hidden states for both directions in the contextual layer. For the comparison with DeepTyper and LambdaNet, we implemented their models in our evaluation framework and used the same hyperparameters as suggested in their papers and open-source implementations.

4.1 RQ.1: What are the impacts of our GNN model design choices on the accuracy of type inference?

As discussed in Section 2.2, our GNN-based type inference system allows multiple design choices across the different components. We studied how they could affect the accuracy of type prediction by evaluating various combinations of those design choices on our dataset.

4.1.1 The Number of GNN Propagation Steps

The GNN architecture includes a KK-step iterative process of propagating information on the input graph. This means a node can gather type information from nodes that are at most KK hops away. Intuitively, a larger value of KK would likely lead to a higher type prediction accuracy since more information can be utilized for making the prediction. However, as the value of KK increases, the model needs more computation resources. To select an appropriate value of KK, we made an empirical study: trained the R-GNN model with values of KK ranging from 2 to 12 and measured the resulting accuracy on the test set. The results are shown in Figure 7. As can be seen from the figure, the top-1 accuracy increases monotonically as the value of KK gets larger, but begins to saturate around the point where K=8K=8. Based on this observation, we chose K=8K=8 for the rest of our experiments.

Refer to caption
Figure 7: The impact of the number of propagation steps (KK) on type prediction accuracy for the R-GNN model.

4.1.2 Recurrent GNN vs. Convolutional GNN

A recurrent GNN architecture uses the same set of functions (message, aggregation and update functions) and parameters for different steps to aggregate messages from neighbors and update node states, while a convolutional GNN does not share parameters or even functions across steps. As a result, recurrent GNNs can learn to propagate the same type of information between the nodes through several time-steps, and convolutional GNNs usually learn a different kind of node representation at each step. We hypothesize that recurrent GNN architectures can fit the type inference problem better because it can potentially model type propagation rules on TFGs. To validate this hypothesis, we compared the performance of two architectures: R-GNN, our baseline recurrent GNN, and its convolutional variant, C-GNN. As shown in Part 1 of Table 2, R-GNN outperforms C-GNN in all of the six accuracy metrics, indicating that our hypothesis holds true for the datasets used in our evaluation. However, since the difference in the accuracy metrics is small in some cases, further study may be warranted to better understand the capability of both architectures on type inference as part of future work.

Table 2: The accuracy comparison among various GNN designs and prior work.
All 100 Types Top-10 Frequent Types Other 90 Types
Accuracy Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
Part 1: recurrent GNN vs. convolutional GNN
C-GNN 83.10% 96.99% 85.70% 98.68% 52.33% 77.04%
R-GNN 84.27% 97.61% 86.51% 99.00% 57.77% 81.23%
Part 2: the impact of the attention mechanism and edge features
R-GAT 83.62% 97.37% 86.22% 98.94% 52.91% 78.84%
R-GNNNEF{}_{\mbox{\footnotesize{\it NEF}}} 79.91% 97.17% 82.37% 99.96% 50.82% 75.99%
R-GATNEF{}_{\mbox{\footnotesize{\it NEF}}} 80.45% 97.08% 82.83% 98.80% 52.37% 76.83%
Part 3: the impact of name segmentation and the contextual layer
R-GNNNS{}_{\mbox{\footnotesize{\it NS}}} 86.89% 97.66% 89.43% 98.80% 56.81% 84.25%
R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}} 86.43% 98.00% 88.29% 99.08% 64.43% 85.21%
R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} 87.76% 97.99% 90.31% 99.18% 57.56% 83.91%
Part 4: state-of-the-art deep learning type inference
DeepTyper 84.62% 97.79% 86.88% 99.08% 57.94% 82.57%
LambdaNet 79.45% 96.83% 82.32% 98.61% 41.84% 73.52%

4.1.3 The Attention Mechanism and Edge Features

The attention mechanism in message aggregation allows for assigning different importances to messages received from the neighborhood of a node, and thus can help filter out noise or irrelevant information coming from the neighbors. Here we study its effectiveness when applied to our type inference framework. Based on the results shown in Table 2 parts 1 and 2, the precision of R-GNN, which applies a mean aggregation (Equation 7), is better than R-GAT, which applies an attention-based aggregation (Equations 8 and 9), across all accuracy metrics. This implies that attention did not result in a benefit for our framework on the inference problem studied in our evaluation.

As the messages are the results of combining neighbors’ node states and edge features (see Equation 5), those edge features may help encode enough information of a message’s importance into it, and the usefulness of an attention mechanism could be reduced due to this reason. To figure out the impact of edge features and their interaction with the attention mechanism, we also compared the accuracy of models that do not take edge features as input. As shown in part 2 of Table 2, the two models, R-GNNNEF{}_{\mbox{\footnotesize{\it NEF}}} and R-GATNEF{}_{\mbox{\footnotesize{\it NEF}}}, which correspond to R-GNN and R-GAT, but with no edge feature inputs, yield worse accuracy than their counterparts, indicating that edge features play an important role in the message aggregation process. However, R-GATNEF{}_{\mbox{\footnotesize{\it NEF}}} still fails to show significant advantages over R-GNNNEF{}_{\mbox{\footnotesize{\it NEF}}}. This indicates that the attention mechanism may not be able to show its effectiveness on our TFG structure, no matter whether edge features are present or not.

4.1.4 IdentNode State Initialization

There is a general belief that identifier names convey programmers’ intuition related to types. The IdentNode state initialization techniques discussed in Section 3.3.2 explore the potential of using identifier names and their context to help with type prediction. There are two improvements in node state initialization relative to the baseline method used in R-GNN, which simply assigns a trainable embedding vector to a name and uses it as the initial state for the corresponding identifier nodes. R-GNNNS{}_{\mbox{\footnotesize{\it NS}}} incorporated the first improvement – name segmentation – into R-GNN, and R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}} added the second improvement – the contextual layer – to R-GNN. R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} combined the two improvements together. In part 3 of Table 2, we compared the accuracy among the three improved models with R-GNN. All of the three models produced higher accuracy than R-GNN, and the model that combines the two techniques (R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}) gave better accuracy than the two models that only incorporated one of the techniques (R-GNNNS{}_{\mbox{\footnotesize{\it NS}}} and R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}}). This indicates that the combination of name segmentation and the contextual layer can effectively capture type-relevant information and produce high-quality name embeddings to serve as IdentNode initial states, which can benefit GNN-based type prediction.

4.2 RQ.2: How efficient is our GNN-based type inference?

For our approach to be applicable to real-world interactive programming environments, it is important for our techniques to be efficient in practice. To evaluate the efficiency of the models in our framework, We measured their sizes (i.e., the number of model parameters) and inference throughputs. The results are shown in Figure 8. Here we only compare the models that take in edge features.

The number of parameters in a model affects its applicability in several ways. A larger model costs more memory to be loaded and can sometimes bring more computation workload. The blue bars in Figure 8(a) shows the number of parameters of our GNN models. C-GNN contains the largest number of parameters due to its non-recurrent design that requires a separate set of parameters for each step. The size of R-GAT is slightly larger than R-GNN because of its attention mechanism that requires extra parameters. Among the R-GNN variants, the two with name segmentation are significantly smaller than the others thanks to their smaller name segment embedding size (32-dimensional), although the contextual layer increases the model size by bringing in more parameters.

Figure 8(b) compares the computation efficiency of the models by showing their inference throughputs. This metric is computed as the number of source files in the test set divided by the total runtime of performing inference on those files. The batch size used for inference is set to 64 graphs for all models. We took six measurements for each model and reported their mean and standard deviation. C-GNN and R-GNN are the two simplest models in terms of the architecture, requiring less computation during inference, and thus produced the highest throughputs. The attention mechanism in R-GAT adds additional computation overhead, causing it to produce a lower throughput than R-GNN. Similarly, name segmentation and the contextual layer both make the models containing them to be slower. It is worth noting that the addition of the contextual layer brings more computation overhead. This is because it is based on a recurrent neural network, whose workload is determined by the length of the input code token sequence.

Refer to caption
(a) Number of parameters in a model.
Refer to caption
(b) Inference throughput on the test set. The error bars show the standard deviation of 6 measurements.
Figure 8: The comparison of the model size and computation efficiency among our GNN architectures and models proposed in prior work.

4.3 RQ.3: How does our approach compare with state-of-the-art deep learning type inference approaches?

We compared our approach with two state-of-the-art deep learning type inference approaches: DeepTyper [11] and LambdaNet [25].

DeepTyper treats a source file as a sequence of tokens and runs a two-layer bidirectional RNN on it to predict types for identifier tokens. To improve the consistency of type prediction for the same variable, the model also includes a consistency layer between the two RNN layers that takes the mean of hidden states for identifiers with the same name and adds it back to those hidden states. We implemented DeepTyper in our evaluation framework and used the same hyperparameters as suggested in their paper and open-source implementation (300-dimensional embeddings and 650-dimensional hidden layers).666https://github.com/DeepTyper/DeepTyper

LambdaNet is a GNN-based neural type inference model. It extracts a project-level type dependency graph (TDG) through static analysis and uses a convolutional GNN to perform type inference for the graph nodes, each of which is associated with a unique variable or an expression. LambdaNet’s TDG includes a manually defined set of edge types and labels. In contrast, our TFG automatically derives hundreds of edge labels from AST node types and parent-child relationships, which provide richer information to the GNN model and is also more easily adaptive to other languages. We also implemented LambdaNet in our evaluation framework. Unlike the original LambdaNet, our implementation gives a prediction for each occurrence of a variable instead of having one prediction per declaration, and uses a prediction space consisting of types in a fixed vocabulary. Those changes allow us to compare the efficacy of LambdaNet’s TDG and our TFG without the potential interference brought by different prediction schemes. We didn’t change the neural network architecture and used the hyperparameters suggested in their paper and open-source implementation (32-dimensional embeddings and hidden layers, six steps of graph propagation).777https://github.com/MrVPlusOne/LambdaNet

The reason why we chose to reimplement the two state-of-the-art approaches in our own framework, instead of using their released code, is to obtain an apples-to-apples comparison. We studied both DeepTyper’s and LambdaNet’s code repositories and had the following observations:

  • Their training code is tightly coupled with their data processing code, making it hard to reuse their code with a different data preprocessing approach such as that in our approach.

  • Their implementations are based on different deep learning frameworks which could result in efficiency variations for the same model, bringing challenges to our efficiency comparison.

  • LambdaNet’s authors implemented a variant of DeepTyper in their framework for comparison. However, our preliminary experiments showed that some changes introduced by the variant had a negative impact on accuracy on our dataset.

Based on those observations, we decided to use our own implementations of the two approaches in the experiments.

Part 4 of Table 2 shows the accuracy DeepTyper and LambdaNet achieved in our experiments. DeepTyper delivered a similar level of accuracy compared with R-GNN, our baseline GNN model. However, it is outperformed by the three variants of R-GNN that integrates name segmentation and/or the contextualized layer in terms of top-1 accuracy metrics. LambdaNet failed to reach the same level of accuracy as R-GNN. In terms of top-1 accuracy for predicting all 100 types, our best model, R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}, outperformed DeepTyper by 3.14% (absolute) and LambdaNet by 8.31%. Note that since we ported both prior approaches into our experimental framework, which contains different data processing procedures, the accuracy numbers we obtained could not be directly compared with the ones reported in their papers. This is especially the case for LambdaNet due to all the changes we made to it in our implementation.

The orange bars in Figure 8 shows the sizes and inference throughputs of DeepTyper and LambdaNet. As can be seen from Figure 8(a), DeepTyper contains significantly more parameters than other models. This is due to its large hidden state size. In contrast, LambdaNet employs a relatively small hidden state size, resulting in a more compact model. The sizes of our GNN models fall in between and are closer to LambdaNet’s. Shown in Figure 8(b), DeepTyper yielded the lowest throughput among all the models being compared. This is because DeepTyper runs multiple layers of RNN through the code token sequence, which can be very long. Our models with the contextual layer (R-GNNCTX{}_{\mbox{\footnotesize{\it CTX}}} and R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}) suffered from the same situation, but still gave higher throughputs than DeepTyper, because that the contextual layer used only a single layer of RNN. The other four GNN models without the contextual layer achieved higher throughputs than LambdaNet. The reason comes from that LambdaNet generally contain more edges than our TFGs, thus requiring more computation resource during message passing and aggregation.

5 Related Work

As mentioned earlier, there has been a significant amount of work during the past two decades on applying machine learning to address a number of compilation-related problems, and this work has been enabled by the rapid growth of open source online code repositories. We only discuss the most closely related efforts from past work in this section.

JSNice [21] is a program property predictor for variable names and type information. It applies probabilistic reasoning to infer primitive types in JavaScript code. Its input is a dependency network among variables in JavaScript. The trained model learns the statistical correlations among the node in the networks extracted from the code corpus, and predicts type information for variables by giving the rank of probabilities. More recently, a deep neural network based type inference mechanism, DeepTyper, was proposed [11]. Inspired by past work on natural language processing, DeepTyper takes a tokenized source program as input and produces type tokens for identifiers. Its actual architecture employs a bi-directional gated recurrent unit network to capture a large context around each token. They also incorporated naming information (e.g., variable names) into the type inference. LambdaNet [25] used a graph neural network to perform type inference. It leveraged an auxiliary analysis to help with building a dataflow-graph-like structure – the type dependence graph – as input, which establishes the relationship among program elements. Section 4 included a detailed comparison of the accuracy and efficiency of our approach relative to DeepTyper and LambdaNet.

The application of GNNs to program analyses can be seen in other recent work as well. One example is the gated graph neural network (GGNN) [14], which has been used to learn program representations [2]. The graph used in this work is based on an extension to the abstract syntax tree by adding a few types of edges to represent extra information including lexical order and variable usages. Their approach is applied to program property prediction tasks, e.g., variable misuse detection. [7] introduced a generative code modeling approach, which is also based on GGNNs, and can be applied to program repair and variable-misuse tasks. [12] used a combination of sequence layers and graph message-passing layers in their model to capture contextual information from the input program. By leveraging more program context information using the sequence layers, their model outperformed the earlier work in [2]. [23] introduced a reinforcement learning system to infer loop invariants in an input program. A GNN was used to construct the structural external memory representation for a program.

Compared with past work, our approach introduces a lightweight GNN-based type inference system that constructs a simple graph structure (the type flow graph) via an efficient traversal of the abstract syntax tree. To enhance the system’s accuracy and achieve high efficiency, we evaluated several GNN design choices that can improve the information representation and computation efficiency, including recurrent update process, attention mechanism, and a contextual layer for improving the graph node state initialization. Based on our neural architecture design approach, our GNN-based system shows a strong potential to be a viable candidate for lightweight analyses with superior accuracy for use in a variety of interactive programming tasks.

6 Conclusions and Future Work

In this paper, we advanced past work by introducing a range of graph-based deep learning models that operate on a novel type flow graph (TFG) representation. Our GNN based models learn to propagate type information on the graph and then predict types for the graph nodes. We studied different design choices for our GNN-based type inference system for the 100 most common types in our evaluation corpus, and show that our best GNN configuration for accuracy (R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}}) achieves a top-1 accuracy of 87.76%. This outperforms the two most closely related deep learning type inference approaches from past work – DeepTyper with a top-1 accuracy of 84.62% and LambdaNet with a top-1 accuracy of 79.45%. Alternatively, we can state the error (100% - accuracy) for R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 0.80×\times that of DeepTyper and 0.60×\times that of LambdaNet. Further, the average inference throughput of R-GNNNS-CTX{}_{\mbox{\footnotesize{\it NS-CTX}}} is 353.8 files/second, compared to 186.7 files/second for DeepTyper and 1,050.3 files/second for LambdaNet. If inference throughput is a higher priority, then the recommended model to use from our approach is the next best GNN configuration from the perspective of accuracy (R-GNNNS{}_{\mbox{\footnotesize{\it NS}}}) which achieved a top-1 accuracy of 86.89% and an average inference throughput of 1,303.9 files/second. Thus, our work introduces advances in graph-based deep learning that yield superior accuracy and performance to past work on probabilistic type analysis, while also providing a range of GNN models that could be applicable in the future to other graph structures used in program analysis beyond the TFG.

References

  • [1] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), July 2018. doi:10.1145/3212695.
  • [2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. CoRR, abs/1711.00740, 2017. URL: http://arxiv.org/abs/1711.00740, arXiv:1711.00740.
  • [3] Miltiadis Allamanis and Charles Sutton. Mining idioms from source code. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014. ACM. URL: http://doi.acm.org/10.1145/2635868.2635901, doi:10.1145/2635868.2635901.
  • [4] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: learning distributed representations of code. PACMPL, 3(POPL):40:1–40:29, 2019. doi:10.1145/3290353.
  • [5] Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. Active learning of points-to specifications. In Jeffrey S. Foster and Dan Grossman, editors, Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages 678–692. ACM, 2018. doi:10.1145/3192366.3192383.
  • [6] Peter Battaglia, Jessica Blake Chandler Hamrick, Victor Bapst, Alvaro Sanchez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andy Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Jayne Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. arXiv, 2018. URL: https://arxiv.org/pdf/1806.01261.pdf.
  • [7] Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. Generative code modeling with graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL: https://openreview.net/forum?id=Bke4KsA5FX.
  • [8] Victor Chibotaru, Benjamin Bichsel, Veselin Raychev, and Martin T. Vechev. Scalable taint specification inference with big code. In Kathryn S. McKinley and Kathleen Fisher, editors, Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019, pages 760–774. ACM, 2019. doi:10.1145/3314221.3314648.
  • [9] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi, editors, Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pages 103–111. Association for Computational Linguistics, 2014. URL: https://www.aclweb.org/anthology/W14-4012/, doi:10.3115/v1/W14-4012.
  • [10] Philip Gage. A new algorithm for data compression. C Users J., 12(2):23–38, February 1994.
  • [11] Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. Deep learning type inference. In Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu, editors, Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pages 152–162. ACM, 2018. doi:10.1145/3236024.3236051.
  • [12] Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. Global relational models of source code. In 8th International Conference on Learning Representations, ICLR 2020,. OpenReview.net, 2019. URL: https://openreview.net/pdf?id=Hkx6hANtwH.
  • [13] Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. Big code!= big vocabulary: Open-vocabulary models for source code. arXiv preprint arXiv:2003.07914, 2020.
  • [14] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL: http://arxiv.org/abs/1511.05493.
  • [15] Percy Liang and Mayur Naik. Scaling abstraction refinement via pruning. In Mary W. Hall and David A. Padua, editors, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, pages 590–601. ACM, 2011. doi:10.1145/1993498.1993567.
  • [16] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7.
  • [17] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Bayesian specification learning for finding API usage errors. In Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman, editors, Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017, pages 151–162. ACM, 2017. doi:10.1145/3106237.3106284.
  • [18] Hakjoo Oh, Wonchan Lee, Kihong Heo, Hongseok Yang, and Kwangkeun Yi. Selective context-sensitivity guided by impact pre-analysis. In Michael F. P. O’Boyle and Keshav Pingali, editors, ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, pages 475–484. ACM, 2014. doi:10.1145/2594291.2594318.
  • [19] Hakjoo Oh, Hongseok Yang, and Kwangkeun Yi. Learning a strategy for adapting a program analysis via bayesian optimisation. In Jonathan Aldrich and Patrick Eugster, editors, Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 572–588. ACM, 2015. doi:10.1145/2814270.2814309.
  • [20] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 8024–8035, 2019. URL: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
  • [21] Veselin Raychev, Martin T. Vechev, and Andreas Krause. Predicting program properties from "big code". In Sriram K. Rajamani and David Walker, editors, Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, pages 111–124. ACM, 2015. doi:10.1145/2676726.2677009.
  • [22] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909, 2015. URL: http://arxiv.org/abs/1508.07909, arXiv:1508.07909.
  • [23] Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, and Le Song. Learning loop invariants for program verification. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 7762–7773, 2018. URL: http://papers.nips.cc/paper/8001-learning-loop-invariants-for-program-verification.
  • [24] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL: https://openreview.net/forum?id=rJXMpikCZ.
  • [25] Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. Lambdanet: Probabilistic type inference using graph neural networks. In 8th International Conference on Learning Representations, ICLR 2020,. OpenReview.net, 2019. URL: https://openreview.net/pdf?id=Hkx6hANtwH.
  • [26] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.
  • [27] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015. URL: http://arxiv.org/abs/1505.00853, arXiv:1505.00853.
  • [28] Xin Zhang, Mayur Naik, and Hongseok Yang. Finding optimum abstractions in parametric dataflow analysis. In Hans-Juergen Boehm and Cormac Flanagan, editors, ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, Seattle, WA, USA, June 16-19, 2013, pages 365–376. ACM, 2013. doi:10.1145/2491956.2462185.