stix@largesymbols"0E stix@largesymbols"0F

Improving Dynamic Code Analysis by Code Abstraction

Isabella Mastroeni Department of Computer Science, University of Verona (Italy) [email protected] Department of Environmental Sciences, Informatics and Statistics,
Ca’ Foscari University of Venice (Italy) Vincenzo Arceri Department of Environmental Sciences, Informatics and Statistics,
Ca’ Foscari University of Venice (Italy) [email protected]

Abstract

In this paper, our aim is to propose a model for code abstraction, based on abstract interpretation, allowing us to improve the precision of a recently proposed static analysis by abstract interpretation of dynamic languages. The problem we tackle here is that the analysis may add some spurious code to the string-to-execute abstract value and this code may need some abstract representations in order to make it analyzable. This is precisely what we propose here, where we drive the code abstraction by the analysis we have to perform.

1 Introduction

The possibility of dynamically building code instructions as the result of text manipulation is a key aspect in dynamic programming languages. In this scenario, programs can turn text, which can be built at run-time, into executable code [26]. These features are often used in code protection and tamper-resistant applications, employing camouflage for escaping attack or detection [22], in malware, in mobile code, in web servers, in code compression, and in code optimization, e.g., in Just-in-Time (JIT) compilers, employing optimized run-time code generation.
While the use of dynamic code generation may simplify considerably the art and performance of programming, this practice is also highly dangerous, making the code prone to unexpected behaviors and malicious exploits of its dynamic vulnerabilities, such as code/object-injection attacks for privilege escalation, database corruption, and malware propagation. It is clear that more advanced and secure functionalities based on string-to-code statements could be permitted if we better master how to safely generate, analyze, debug, and deploy programs that dynamically generate and manipulate code.

There are lots of good reasons to analyze programs building strings that can be later executed as code. An interesting example is code obfuscation. Recently, several techniques have been proposed for JavaScript code obfuscation¹¹1https://www.daftlogic.com/projects-online-javascript-obfuscator.htm,
http://www.danstools.com/javascript-obfuscate/,
http://javascript2img.com/,
https://javascriptobfuscator.herokuapp.com/,
https://javascriptobfuscator.com/, meaning that also client-side code protection is becoming an increasingly important problem to be tackled by the research community and by practitioners. Hence, it is not always possible to simply ignore eval without accepting to lose the possibility of analyzing the rest of the program [4].

The Context: Analyzing Dynamic Code.

A major problem in presence of dynamic code generation is that static analysis becomes extremely hard if not impossible. This happens because program data structures, such as the control-flow graph and the system of recursive equations associated with the program in question, are themselves dynamically mutating objects. Recently [4], the problem of analyzing dynamic code has been tackled by treating code as any other dynamic structure that can be statically analyzed by abstract interpretation, and to treat the abstract interpreter as any other program function that can be recursively called. In particular, in [4], we provide a static analyzer architecture for a core dynamic language, containing non-removable eval statements, that still has some limitation in terms of precision but provides the necessary ground for studying more precise solutions to the problem. In particular,

$\bullet$

We have designed an automata-based string abstract domain [5] for analyzing string values during execution. Automata (FA) provide the perfect choice for abstracting strings that may be executed by eval since they allow us to over-approximate the set of possible values of string variables by keeping enough information for both analyzing properties of string variables that are never executed by an eval during computation and for extracting the potential executable sub-language.
$\bullet$

In order to statically analyze the code potentially executed by an eval, we have designed a systematic process for extracting from the (abstract) argument of eval (i.e., from the FA collection of its potential arguments) an over-approximation of executable code that this collection contains. Clearly, this approximation must keep a form that the analyzer can interpret.
$\bullet$

We designed a static analyzer for dynamic languages performing a recursive call of the interpreter on the (over-approximated) code that eval may execute.

The Problem: Improve Precision Analysis by Abstracting Code.

This analysis provides a first step towards the analysis of dynamic languages but still has some important precision loss [4]. In particular, there are particular forms of FA (which occur when the string is dynamically generated by loops) avoiding the possibility of generating a control flow graph ( $\mathsf{CFG}$ ) able to approximate the code executed by an eval. For instance, when the FA accepts a language such as ${{\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}=(5)}}}}^{n}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers;}}}}\left|\begin{array}[]{l}n>0\end{array}\right.\!\right\}$ , the analysis in [4] cannot extract, from the FA, the $\mathsf{CFG}$ approximating the eval argument. In order to better explain the problem, consider the code in Fig. 1, where the value of i is statically unknown. In Fig. 1, we draw the automaton $A$ representing the abstract value of str before the eval execution. The problem is that $A$ has a cycle not involving a whole statement [4]. This situation makes the analyzer unable to build a $\mathsf{CFG}$ over-approximating the code potentially executed since, intuitively, such a $\mathsf{CFG}$ should be infinite. Indeed, only an infinite $\mathsf{CFG}$ could capture all the possible assignments described by the FA, namely all the assignments of any possible number formed only by ${\tt 5}$ to the variable ${\tt x}$ (i.e., x=5;,x=55;,x=555; $\dots$ ).

⬇ str = "x=5"; while (i < 3) { str = str + "5"; i = i + 1; } str = str + ";"; eval(str);

Figure 1:

A

s.t.

\mathcal{L}(A)=\{\mathtt{x=5}^{n}\mathtt{;}\leavevmode\nobreak\ |\leavevmode\nobreak\ n>0\}

, where

\mathtt{5}^{n}

means

\mathtt{5}

repeated

n

times.

In order to make it possible to overcome this limitation, at least for a set of potential eval patterns, we propose to define a form of abstract $\mathsf{CFG}$ able to finitely represent a potential infinite set of $\mathsf{CFG}$ s, e.g., we look for a $\mathsf{CFG}$ representing x=5^∗.
Unfortunately, things are not so easy as it may seem, since this abstract code representation has to be built in such a way that the analyzer may still be able to interpret it.

Contribution.

The main contributions for tackling the problem above are:

$\bullet$

We first define the notion of abstract $\mathsf{CFG}$ , based on the idea of making it possible to still perform a given analysis. The idea is to leave the control structure unchanged while approximating the edge labels (the statements to execute) to sets of labels, i.e., those sharing a fixed abstract property.
$\bullet$

We show how completeness of code abstraction w.r.t. the semantic observation models the possibility, for the static analyzer, of interpreting also the abstract code, and we show how we can make any code abstraction complete.
$\bullet$

We provide a systematic approach, based on the one proposed in [4], allowing us to analyze also the eval patterns described above, for which, instead, the analysis in [4] loses precision.

2 The Core Language: Imp

The language is quite standard (see Fig. 2²²2We use $n$ to denote the semantic value corresponding to the syntactic symbol n.), and each statement is annotated with a label $\ell\in\mbox{\sl Lab$$}$ (not part of the syntax) corresponding to the statement program point³³3 We suppose that there exists a function that, taken a well-written program, can label it with a fresh label for each program point..

	$\displaystyle\mbox{Exp}\ni\mbox{\tt e}$	$\displaystyle::=\;\mbox{\tt a}\mid{\sf s}$
	$\displaystyle\mbox{AExp}\ni\mbox{\tt a}$	${{\displaystyle::=\;\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\mid\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}}\mid\mbox{\tt a}+\mbox{\tt a}\mid\mbox{\tt a}-\mbox{\tt a}\mid\mbox{\tt a}*\mbox{\tt a}$
	$\displaystyle\mbox{BExp}\ni\mbox{\tt b}$	${\displaystyle::=\;\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\mid{\tt true}\mid{\tt false}\mid\mbox{\tt e}=\mbox{\tt e}\mid\mbox{\tt e}>\mbox{\tt e}\mid\mbox{\tt e}<\mbox{\tt e}\mid\mbox{\tt b}\wedge\mbox{\tt b}\mid\neg{\tt b}$
	$\displaystyle\operatorname{SExp}\ni{\sf s}$	${{\displaystyle::=\;\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\mid\mbox{\tt"$\sigma$"}\mid\mbox{\tt concat(${\sf s}$,${\sf s}$)}\mid\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{substr}}}}}}({\sf s},\mbox{\tt a},\mbox{\tt a})\$
	$\displaystyle\operatorname{Comm}\ni\mbox{\tt c}$	${{\displaystyle::=\;{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\bf skip}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\mid{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\mid{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\tt c};{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\mbox{\tt c}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{3}}}}\mid{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{if}}}}}}\leavevmode\nobreak\ ({\tt b})\leavevmode\nobreak\ \{{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\mbox{\tt c}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{3}}}}\}\leavevmode\nobreak\ \{{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{4}}}}\mbox{\tt c}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{5}}}}\}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{6}}}}$
		${{\displaystyle\mid\ {{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{while}}}}}}\leavevmode\nobreak\ ({\tt b})\leavevmode\nobreak\ \{{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\mbox{\tt c}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{3}}}}\}{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{4}}}}\mid{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{1}}}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s}){{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}$
	$\displaystyle\textsf{Imp}\ni\tt P$	${\displaystyle::={{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{\iota}}}}\mbox{\tt c};{{}^{{\color[rgb]{.6,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{.6,0,0}\ell_{2}}}}\qquad\mbox{ where }\mathsf{Id}\ni\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\mbox{ (Identifiers)},n\in\mathbb{Z},\sigma\in\Sigma^{*}$

Figure 2: Syntax of Imp

In order to analyze a program ${\mbox{\sf P}}\in\textsf{Imp}$ , we need to model it by building a corresponding control flow graph [28] ( $\mathsf{CFG}$ for short), which embeds the control structure in the graph structure and leaves in the edges (or equivalently on the nodes) only the access to states, i.e., manipulation of the states (assignments) and guards. The approach we use is quite standard, and we follow [28] for the construction of the control flow graph. For technical details see [4], here we show the construction on the example in Fig. 3, where $i$ denotes the node corresponding to the program point $\ell_{\_}i$ .

Note that, by construction [4], the language of the $\mathsf{CFG}$ edge labels is an intermediate language slightly different from the Imp grammar. Edge labels correspond to a primitive statement (i.e., an assignment or eval) or a boolean guard, namely they form the language $\Psi$ generated by the grammar ${{\mathtt{l}::=\;\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}\mid{\tt b}\mid\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s})$ .

Concrete Semantics.

The concrete semantics of our language Imp is intuitive and it is fully reported in [3]. Since our aim is to analyze Imp programs by analyzing their $\mathsf{CFG}$ s, we focus here only on the interpretation of $\mathsf{CFG}$ ’s labels [28]. In particular, we have to specify the semantics associated with each possible edge of the $\mathsf{CFG}$ . In other words, we have to formalize how each statement transforms a current state, which is represented as a store, namely as an association between identifiers and values. It is well known that static program analysis works by computing (abstract) collecting semantics, namely, for each program point $\ell$ and for each variable x, it computes the set of values that the variable x can have in any computation at the program point $\ell$ . Hence, we define (collecting) memories $\mathbb{m}$ , associating with each variable a set of values. The basic values of Imp are integers, booleans and strings, hence we define the set of memories as $\mathbb{M}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\textsf{Var}\rightarrow(\wp(\mathbb{Z})\cup\mathsf{Bool}\cup\wp(\Sigma^{*}))$ , ranged over the meta-variable $\mathbb{m}$ , where $\mathsf{Bool}=\wp(\{{\tt false},{\tt true}\})$ . Let us denote by $\mathbb{V}$ this domain of collections of values $\wp(\mathbb{Z})\cup\mathsf{Bool}\cup\wp(\Sigma^{*})$ . The update of memory $\mathbb{m}$ for a variable x with set of values $v$ is denoted ${\mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/v]$ . The partial order $\sqsubseteq$ between memories is defined as ${{{\mathbb{m}_{\_}1\sqsubseteq\mathbb{m}_{\_}2\Leftrightarrow\forall\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\in\mathsf{Id}.\>\mathbb{m}_{\_}1(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\subseteq\mathbb{m}_{\_}2(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ . Finally, lub and glb of memories are computed point-wise, i.e., ${{{\mathbb{m}_{\_}1\sqcup\mathbb{m}_{\_}2\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\lambda\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}.\>\mathbb{m}_{\_}1(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\cup\mathbb{m}_{\_}2(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ and ${{{\mathbb{m}_{\_}1\sqcap\mathbb{m}_{\_}2\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\lambda\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}.\>\mathbb{m}_{\_}1(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\cap\mathbb{m}_{\_}2(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ .
The collecting (input/output) semantics of statements $\mbox{\tt c}\in\Psi$ is defined as the function ${\llbracket\mbox{\tt c}\rrbracket}:\mathbb{M}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\mathrel{\mathop{\hskip 1.0pt\longrightarrow\hskip 1.0pt}\limits^{\,{}_{\mbox{\tiny}}}}$}}\mathbb{M}$ . We denote by ${\llparenthesis\hskip 0.86108pt\cdot\hskip 0.86108pt\rrparenthesis}$ the collecting semantics of expressions, defined as additive lift⁴⁴4Let $f:S\rightarrow S$ be a generic function, by additive lift we mean its extension to sets of elements, i.e., $\forall X\subseteq S$ we define $f(X)\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ f(x)\left|\begin{array}[]{l}x\in S\end{array}\right.\!\right\}$ . If $f:S\rightarrow\wp(S)$ , then its lift to sets of memories is $f(X)\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\bigcup\left\{\leavevmode\nobreak\ f(x)\left|\begin{array}[]{l}x\in S\end{array}\right.\!\right\}$ to sets of memories of the standard expression semantics. We abuse notation by denoting as ${\llbracket\cdot\rrbracket}$ also its additive lift to sets of statements.

{{{\begin{array}[]{rcl}{\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}\rrbracket}\mathbb{m}&=&\mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/{\llparenthesis\hskip 0.86108pt\mbox{\tt e}\hskip 0.86108pt\rrparenthesis}\mathbb{m}]\qquad{\llbracket\mbox{\tt b}\rrbracket}\mathbb{m}=\mathbb{m}\sqcap\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\llparenthesis\hskip 0.86108pt\mbox{\tt b}\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\tt true}\end{array}\right.\!\right\}\\ {\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s})\rrbracket}\mathbb{m}&=&{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\Cap\textsf{Imp}\rrbracket}\mathbb{m}\end{array}

where $\Cap$ is the intersection in the set of Imp programs. By computing the traces of application of this transfer function, starting from any possible input memory, we precisely compute the maximal trace semantics [23].

Static Analysis on $\mathsf{CFG}$ : Semantic Abstraction.

It is well known that when we perform static analysis on a $\mathsf{CFG}$ , we interpret, on the corresponding abstract domain, all the edges, and more specifically all the labels (in $\Psi$ ) [28]. This is also a quite standard approach, but we recall it here for fixing the notation used. We suppose to abstract values on the coalesced sum [3] of the $\mathsf{Sign}$ abstract domain for integers, of the concrete domain for booleans and of the (deterministic) finite state automata abstract domain for strings [3]⁵⁵5A string static analyzer using finite state automata abstract domain has been developed and it is available in [3].. Let us consider an abstraction $\rho\in\mbox{\it uco}(\mathbb{V})$ ⁶⁶6For the sake of simplicity here we abuse notation by considering a unique $\rho$ which is indeed the coalesced sum of three abstractions, one for integers, one for booleans and one for strings. of the values manipulated by our language, we denote by $\mathbb{M}^{\rho}:\textsf{Var}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\mathrel{\mathop{\hskip 1.0pt\longrightarrow\hskip 1.0pt}\limits^{\,{}_{\mbox{\tiny}}}}$}}\rho(\mathbb{V})$ the set of (collecting) memories, where sets of values are abstracted by $\rho$ , ranged over $\mathbb{m}^{\rho}$ . In the following, we abuse notation by applying $\rho$ to memories in $\mathbb{M}$ , simply by defining $\rho(\mathbb{m})\in\mathbb{M}^{\rho}$ as ${{\rho(\mathbb{m}):\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\in\textsf{Var}\mapsto\rho(\mathbb{m}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}))$ ⁷⁷7For the sake of simplicity of presentation and implementation, we have considered here non-relational abstractions of data, anyway we believe that it is possible to easy extend our work to relational abstractions.. In this way, we can see abstract memories as sets of concrete memories, and therefore as particular collecting memories, i.e., $\mathbb{M}^{\rho}\subseteq\mathbb{M}$ . Finally, we can define the abstract edge effect ${\llbracket\cdot\rrbracket}^{\rho}$ [28] telling us how to abstractly interpret each edge of the $\mathsf{CFG}$ :

{{{\begin{array}[]{rcl}{\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}\rrbracket}^{\rho}\mathbb{m}^{\rho}&=&\mathbb{m}^{\rho}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/\rho({\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}^{\rho})]\qquad\qquad{\llbracket\mbox{\tt b}\rrbracket}^{\rho}\mathbb{m}^{\rho}=\mathbb{m}^{\rho}\sqcap\rho(\sqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\tt true}\in{\llparenthesis\hskip 0.86108pt\mbox{\tt b}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}^{\rho}\end{array}\right.\!\right\})\\ {\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s})\rrbracket}^{\rho}\mathbb{m}^{\rho}&=&{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}^{\rho}\Cap\textsf{Imp}\rrbracket}^{\rho}\mathbb{m}^{\rho}\end{array}

where ${\llparenthesis\hskip 0.86108pt\cdot\hskip 0.86108pt\rrparenthesis}^{\rho}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\rho\comp{\llparenthesis\hskip 0.86108pt\cdot\hskip 0.86108pt\rrparenthesis}\comp\rho$ .The semantics of a path in the $\mathsf{CFG}$ is the composition of the interpretation of each edge, and the interpretation of an edge is the interpretation, given above, of its label [28].

This is clearly, what happens when the $\mathsf{CFG}$ is not abstracted, namely when the edge labels are single statements. Finally, since we deal with potential abstract $\mathsf{CFG}$ , we have to say how we execute them, potentially on an abstract semantics. The idea is simple, since we move from executing single statements to executing sets of statements, we simply take as execution of the abstract $\mathsf{CFG}$ the additive lift of the single statements executions. Since the semantics is always additive⁸⁸8A function is said to be additive if it commutes with least upper bound., in order to guarantee that everything works, also the semantic abstraction $\rho$ must be additive. Hence, in the following of the paper we always require $\rho$ to be additive.

3 Semantic-driven Code Abstraction

In this section, we study how we can model a syntactic abstraction of the $\mathsf{CFG}$ and which is its relation with the semantic abstraction, i.e., the code analysis.

Modeling Code Abstraction.

Following the standard approach for abstracting objects, we should abstract each $\mathsf{CFG}$ in a set of $\mathsf{CFG}$ s sharing an invariant property, i.e., an equivalence class of $\mathsf{CFG}$ s. In particular, since we aim at abstracting code ( $\mathsf{CFG}$ ) without changing the analysis performed on the code, we choose to abstract $\mathsf{CFG}$ by abstracting edge labels, and by leaving unchanged the control structure of the $\mathsf{CFG}$ . In other words, an abstract $\mathsf{CFG}$ , denoted $\mathsf{CFG}^{\#}$ , is a pair $\langle\mbox{\sl Nodes},\mbox{\sl Edges}^{\#}\rangle$ , where we leave the nodes unchanged, while the edge labels are abstracted to sets of labels. Formally, $\mbox{\sl Edges}^{\#}\subseteq\mbox{\sl Nodes}\times\wp(\Psi)\times\mbox{\sl Nodes}$ , where $\Psi$ is the $\mathsf{CFG}$ label language.

Given $\eta\in\mbox{\it uco}(\wp(\Psi))$ , $\mathtt{G}^{\eta}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\langle\mbox{\sl Nodes}(\mathtt{G}),\mbox{\sl Edges}^{\eta}(\mathtt{G})\rangle$ is the $\mathsf{CFG}^{\#}$ built from a $\mathsf{CFG}$ $\mathtt{G}$ in terms of $\eta$ , where $\mbox{\sl Edges}^{\eta}(\mathtt{G})\subseteq\mbox{\sl Nodes}(\mathtt{G})\times\eta(\wp(\Psi))\times\mbox{\sl Nodes}(\mathtt{G})$ .

As an example, consider the $\mathsf{CFG}$ in Fig. 3, in Fig. 4 we have the $\mathsf{CFG}^{\#}$ where numerical expressions are abstracted by ${\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{m}}}}}}\left|\begin{array}[]{l}m\in\mathsf{Sign}(n)\end{array}\right.\!\right\}$ ⁹⁹9We use $n$ to denote the semantic value corresponding to the syntactic symbol n. (where $\mathsf{Sign}$ is the well-known sign abstraction $\mathsf{Sign}\in uco(\wp(\mathbb{Z}))$ such as $\mathsf{Sign}(\wp(\mathbb{Z}))=\{\top,\mathbb{Z}^{+},\mathbb{Z}^{-},\{0\},\varnothing\}$ ). For instance, x:=x+1 is abstracted in ${\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{x}}+}}}}\mathbb{Z}^{+}$ where ${{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}+}}}}\mathbb{Z}^{+}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}+{\@listingGroup{ltx_lst_identifier}{n}}}}}}\left|\begin{array}[]{l}n\in\mathbb{Z}^{+}\end{array}\right.\!\right\}$ , being $\mathsf{Sign}(1)=\mathbb{Z}^{+}$ .

Abstracting Code vs Abstracting Semantics.

As previously noted, we aim at characterizing code abstractions, for dynamically generated code, for which the given analysis works precisely. Formally, let us consider the following equation:

\forall\mathbb{m}^{\rho}\in\mathbb{M}^{\rho}\subseteq\mathbb{M}.\>\forall\varphi\in\Psi.\>{\llbracket\eta(\varphi)\rrbracket}\mathbb{m}^{\rho}={\llbracket\eta(\varphi)\rrbracket}^{\rho}\mathbb{m}^{\rho}\vspace{-.2cm}

(1)

If this equality does not hold it means that the abstract semantic interpretation ${\llbracket\cdot\rrbracket}^{\rho}$ merges predicates distinguished by $\eta$ . Namely, when the program is observed by means of its (abstract) semantics the actual abstraction of predicates is not precisely $\eta$ , but it is $\eta$ affected in some way by ${\llbracket\cdot\rrbracket}^{\rho}$ . By changing the point of view, we have that, in this case, the analysis cannot precisely interpret the abstract code, since $\eta$ abstracts the code by distinguishing information that $\rho$ cannot distinguish.
As an example, consider the sign domain above, when ${{\eta(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=5}}}})=\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{n}}}}}}\left|\begin{array}[]{l}1\leq n\leq 5\end{array}\right.\!\right\}$ the equation does not hold since the concrete semantics of this set does not take any positive value for x. While, if ${{\eta(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=5}}}})=\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{n}}}}}}\left|\begin{array}[]{l}n\in\mathbb{Z}^{+}\cup\{0\}\end{array}\right.\!\right\}$ , then Eq. 1 holds since its concrete semantics is precisely the set of non-negative values. It is worth noting that Eq. 1 is a forward completeness [16] of the code abstraction w.r.t. the semantic interpretation, meaning that the semantic abstraction does not add imprecision to the code one.
In order to investigate the relation existing between the code abstraction $\eta$ and the semantic abstraction $\rho$ , we observe that, whenever we have a semantic abstraction $\rho$ , we have a natural code abstraction induced by $\rho$ . Namely, by only observing (abstract) information about the computation, we cannot distinguish statements with the same (abstract) semantics, independently from what any possible code abstraction does. For instance, if we analyze parity of program variables, we are unable to distinguish x:=2 from x:=4, independently from how a potential code abstraction $\eta$ is defined on x:=2. The first step consists in defining a code abstraction for expressions in terms of semantic one. Consider $\rho\in\mbox{\it uco}(\mathbb{V})$ , we define $\widehat{\rho}(\mbox{\sf e})$ inductively on the expressions structure

{{{{{{{{{{{{\begin{array}[]{ll}\widehat{\rho}(\mbox{\tt a})&:\left\{\begin{array}[]{ll}\widehat{\rho}(\mbox{\tt a}_{\_}1\mbox{\tt op}\>\mbox{\tt a}_{\_}2)\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\tt a}^{\prime}\mbox{\tt op}\>\mbox{\tt a}^{\prime\prime}\left|\begin{array}[]{l}\mbox{\tt a}^{\prime}\in\widehat{\rho}(\mbox{\tt a}_{\_}1),\mbox{\tt a}^{\prime\prime}\in\widehat{\rho}(\mbox{\tt a}_{\_}2)\end{array}\right.\!\right\}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\widehat{\rho}(\mbox{\tt a}_{\_}1)\mbox{\tt op}\>\widehat{\rho}(\mbox{\tt a}_{\_}2)\\ \widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}},\qquad\widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{m}}}}}}\left|\begin{array}[]{l}m\in\rho(\{n\})\end{array}\right.\!\right\}\\ \end{array}\right.\\ \widehat{\rho}(\mbox{\tt b})&:\left\{\begin{array}[]{ll}\widehat{\rho}(\mbox{\tt b}_{\_}1\mbox{\tt bop}\>\mbox{\tt b}_{\_}2)\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\widehat{\rho}(\mbox{\tt b}_{\_}1)\mbox{\tt bop}\>\widehat{\rho}(\mbox{\tt b}_{\_}2),\qquad\widehat{\rho}(\neg\mbox{\tt b})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\neg\widehat{\rho}(\mbox{\tt b})\\ \widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}},\qquad\widehat{\rho}({\tt true})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{t}}}}}}\left|\begin{array}[]{l}t\in\rho({\tt true})\end{array}\right.\!\right\},\qquad\widehat{\rho}({\tt false})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{t}}}}}}\left|\begin{array}[]{l}t\in\rho({\tt false})\end{array}\right.\!\right\}\end{array}\right.\\ \widehat{\rho}({\sf s})&:\left\{\begin{array}[]{ll}\widehat{\rho}(\mbox{\tt concat(${\sf s}_{\_}1$,${\sf s}_{\_}2$)})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\tt concat($\widehat{\rho}({\sf s}_{\_}1)$,$\widehat{\rho}({\sf s}_{\_}2)$)},\\ \widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{substr}}}}}}({\sf s},\mbox{\tt a}_{\_}1,\mbox{\tt a}_{\_}2))\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{substr}}}}}}(\widehat{\rho}({\sf s}),\widehat{\rho}(\mbox{\tt a}_{\_}1),\widehat{\rho}(\mbox{\tt a}_{\_}2))\\ \widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}},\qquad\widehat{\rho}(\mbox{\tt"$\sigma$"})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \mbox{\tt"$\delta$"}\left|\begin{array}[]{l}\delta\in\rho(\sigma)\end{array}\right.\!\right\}\end{array}\right.\end{array}

At this point, we can characterize the $\mathsf{CFG}$ labels abstraction $\overline{\Upsilon}[\rho]:\wp(\Psi)\longrightarrow\wp(\Psi)$ , as the additive lift of the function

{{{\begin{array}[]{rl}\overline{\Upsilon}[\rho](\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=}}}}\mbox{\sf e})&\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}x:=\widehat{\rho}(\mbox{\tt e})\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ x:=\mbox{\tt e}^{\prime}\left|\begin{array}[]{l}\mbox{\tt e}^{\prime}\in\widehat{\rho}(\mbox{\tt e})\end{array}\right.\!\right\}\\ \overline{\Upsilon}[\rho](\mbox{\sf b})&\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\widehat{\rho}(\mbox{\sf b})\qquad\qquad\overline{\Upsilon}[\rho](\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s}))\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}(\widehat{\rho}({\sf s}))\end{array}

where ${\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}(\widehat{\rho}({\sf s}))$ is treated as the implicit representation of all the statements that it can execute, namely it represents the (potentially infinite) set $\left\{\leavevmode\nobreak\ \mathtt{c}\left|\begin{array}[]{l}{\llbracket\mathtt{c}\rrbracket}\mathbb{m}\sqsubseteq{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}^{\rho}\Cap\textsf{Imp}\rrbracket}^{\rho}\mathbb{m}\end{array}\right.\!\right\}$ .

The following result is immediate by construction.

Proposition 3.1

Given $\rho\in\mbox{\it uco}(\mathbb{V})$ , then $\overline{\Upsilon}[\rho]\in\mbox{\it uco}(\wp(\Psi))$ and it is additive.

Finally, in order to show that this code abstraction can be used to force satisfiability of Eq. 1, we have first to characterize the meaning of interpreting an edge label abstracted by $\overline{\Upsilon}[\rho]$ :

{{{\begin{array}[]{rcl}{\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=}}}}\widehat{\rho}(\mbox{\sf e})\rrbracket}\mathbb{m}&=&\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}^{\prime}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}\mbox{\sf e}^{\prime}\in\widehat{\rho}(\mbox{\sf e})\end{array}\right.\!\right\}\qquad\qquad{\llbracket\widehat{\rho}(\mbox{\tt b})\rrbracket}\mathbb{m}=\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mbox{\sf b}^{\prime}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}\mbox{\sf b}^{\prime}\in\widehat{\rho}(\mbox{\sf b})\end{array}\right.\!\right\}\\ {\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}(\widehat{\rho}({\sf s}))\rrbracket}\mathbb{m}&=&\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mathtt{c}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}{\llbracket\mathtt{c}\rrbracket}\mathbb{m}\sqsubseteq{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}^{\rho}\Cap\textsf{Imp}\rrbracket}^{\rho}\mathbb{m}\end{array}\right.\!\right\}\end{array}

Then we have the following results

Lemma 3.2

Given $\rho\in\mbox{\it uco}(\mathbb{V})$ additive, then $\forall\mbox{\sf e}.\>\forall\mathbb{m}\in\mathbb{M}^{\rho}.\>{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e})\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}$ (trivially implying $\mbox{\sf e}^{\prime}\in\widehat{\rho}(\mbox{\sf e})\ \Leftrightarrow\ \forall\mathbb{m}\in\mathbb{M}^{\rho}.\>{\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\subseteq{\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}$ ) and $\forall\Phi\in\wp(\Psi).\>\forall\mathbb{m}\in\mathbb{M}^{\rho}.\>{\llbracket\overline{\Upsilon}[\rho](\Phi)\rrbracket}\mathbb{m}={\llbracket\Phi\rrbracket}^{\rho}\mathbb{m}$ .

Proof 3.3.

Let us prove first the property for expressions by induction on the syntactic structure of e.

${\mbox{\sf e}=\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}}$ : ${{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e})\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}})\hskip 0.86108pt\rrparenthesis}\mathbb{m}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\rho(n)$ , while ${{\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}={\llparenthesis\hskip 0.86108pt\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}=\rho(n)$ (where ${{\llparenthesis\hskip 0.86108pt\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{n}}}}}}\hskip 0.86108pt\rrparenthesis}\mathbb{m}=n$ );
${\mbox{\sf e}=\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}$ : ${{{{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e})\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})\hskip 0.86108pt\rrparenthesis}\mathbb{m}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}{\llparenthesis\hskip 0.86108pt\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\hskip 0.86108pt\rrparenthesis}\mathbb{m}=\mathbb{m}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ , while ${{{{\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}={\llparenthesis\hskip 0.86108pt\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}=\rho(\mathbb{m}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}))=\mathbb{m}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ (since $\mathbb{m}\in\mathbb{M}^{\rho}$ );
${\mbox{\sf e}=\mbox{\sf e}_{\_}1\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>\mbox{\sf e}_{\_}2$ : Suppose op any arithmetic or boolean operator.
${{{{{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e})\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e}_{\_}1\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>\mbox{\sf e}_{\_}2)\hskip 0.86108pt\rrparenthesis}\mathbb{m}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e}_{\_}1)\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>\widehat{\rho}(\mbox{\sf e}_{\_}2)\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e}_{\_}1)\hskip 0.86108pt\rrparenthesis}\mathbb{m}\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>{\llparenthesis\hskip 0.86108pt\widehat{\rho}(\mbox{\sf e}_{\_}2)\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\llparenthesis\hskip 0.86108pt\mbox{\sf e}_{\_}1\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>{\llparenthesis\hskip 0.86108pt\mbox{\sf e}_{\_}2\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}$ by inductive hypothesis. But this is precisely ${{\llparenthesis\hskip 0.86108pt\mbox{\sf e}_{\_}1\>\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{op}}}}}}\>\mbox{\sf e}_{\_}2\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}$ since op is computed on the semantics as additive lift to sets.
Analogously, we can prove all the other cases.

Now, let us prove the fact for $\mathsf{CFG}$ single edge labels, again by induction on the syntactic structure. Note that, being $\rho$ additive then also ${\llbracket\cdot\rrbracket}^{\rho}$ is additive, being also the concrete semantics additive on sets of statements.

{{{{{{{{\begin{array}[]{lll}{\llbracket\overline{\Upsilon}[\rho](\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}})\rrbracket}\mathbb{m}&=&{\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=}}}}\widehat{\rho}(\mbox{\sf e})\rrbracket}\mathbb{m}\\ &=&\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{e}}}}}}^{\prime}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}\mbox{\sf e}^{\prime}\in\widehat{\rho}(\mbox{\sf e})\end{array}\right.\!\right\}\\ &=&\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/{\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}]\left|\begin{array}[]{l}\mbox{\sf e}^{\prime}\in\widehat{\rho}(\mbox{\sf e})\end{array}\right.\!\right\}\\ &=&\mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/\bigcup\left\{\leavevmode\nobreak\ {\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\left|\begin{array}[]{l}\mbox{\sf e}^{\prime}\in\widehat{\rho}(\mbox{\sf e})\end{array}\right.\!\right\}]\\ &=&\mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/\bigcup\left\{\leavevmode\nobreak\ {\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\left|\begin{array}[]{l}{\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\subseteq{\llparenthesis\hskip 0.86108pt\mbox{\sf e}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}\end{array}\right.\!\right\}]\\ &=&\mathbb{m}[\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}}/{\llparenthesis\hskip 0.86108pt\mbox{\sf e}^{\prime}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}]={\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=}}}}\mbox{\sf e}\rrbracket}^{\rho}\mathbb{m}\\ \end{array}

\begin{array}[]{lll}{\llbracket\overline{\Upsilon}[\rho](\mbox{\sf b})\rrbracket}\mathbb{m}&=&{\llbracket\widehat{\rho}(\mbox{\sf b})\rrbracket}\mathbb{m}\\ &=&\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mbox{\sf b}^{\prime}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}\mbox{\sf b}^{\prime}\in\widehat{\rho}(\mbox{\sf b})\end{array}\right.\!\right\}\\ &=&\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\sqcap\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\llparenthesis\hskip 0.86108pt\mbox{\tt b}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\tt true}\end{array}\right.\!\right\}]\left|\begin{array}[]{l}\mbox{\sf b}^{\prime}\in\widehat{\rho}(\mbox{\sf b})\end{array}\right.\!\right\}\\ &=&\mathbb{m}\sqcap\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\llparenthesis\hskip 0.86108pt\mbox{\tt b}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\tt true},\ \mbox{\sf b}^{\prime}\in\widehat{\rho}(\mbox{\sf b})\end{array}\right.\!\right\}\\ &=&\mathbb{m}\sqcap\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\llparenthesis\hskip 0.86108pt\mbox{\tt b}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}={\tt true},\ {\llparenthesis\hskip 0.86108pt\mbox{\sf b}^{\prime}\hskip 0.86108pt\rrparenthesis}\mathbb{m}\subseteq{\llparenthesis\hskip 0.86108pt\mbox{\sf b}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}\end{array}\right.\!\right\}\\ &=&\mathbb{m}\sqcap\bigsqcup\left\{\leavevmode\nobreak\ \mathbb{m}\left|\begin{array}[]{l}{\tt true}\in{\llparenthesis\hskip 0.86108pt\mbox{\sf b}\hskip 0.86108pt\rrparenthesis}^{\rho}\mathbb{m}\end{array}\right.\!\right\}={\llbracket\mbox{\sf b}\rrbracket}^{\rho}\mathbb{m}\\ \end{array}

{{{\begin{array}[]{lll}{\llbracket\overline{\Upsilon}[\rho](\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s}))\rrbracket}\mathbb{m}&=&{\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}(\widehat{\rho}({\sf s}))\rrbracket}\mathbb{m}\\ &=&\bigsqcup\left\{\leavevmode\nobreak\ {\llbracket\mathtt{c}\rrbracket}\mathbb{m}\left|\begin{array}[]{l}{\llbracket\mathtt{c}\rrbracket}\mathbb{m}\sqsubseteq{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}^{\rho}\Cap\textsf{Imp}\rrbracket}^{\rho}\mathbb{m}\end{array}\right.\!\right\}\\ \mbox{By additivity of ${\llbracket\cdot\rrbracket}^{\rho}$}&=&{\llbracket{\llparenthesis\hskip 0.86108pt{\sf s}\hskip 0.86108pt\rrparenthesis}^{\rho}\Cap\textsf{Imp}\rrbracket}^{\rho}\mathbb{m}={\llbracket\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{eval}}}}}}({\sf s})\rrbracket}^{\rho}\mathbb{m}\\ \end{array}

Finally, for each set of labels $\Phi$ , we have that ${\llbracket\overline{\Upsilon}[\rho](\Phi)\rrbracket}\mathbb{m}=\bigsqcup_{\_}{\varphi\in\Phi}{\llbracket\overline{\Upsilon}[\rho](\varphi)\rrbracket}\mathbb{m}=\bigsqcup_{\_}{\varphi\in\Phi}{\llbracket\varphi\rrbracket}^{\rho}\mathbb{m}={\llbracket\Phi\rrbracket}^{\rho}\mathbb{m}$ , since all the involved functions are additive by definition or by construction.

Then we have that:

Theorem 3.4.

Let $\rho\in\mbox{\it uco}(\mathbb{V})$ additive, and $\eta\in\mbox{\it uco}(\wp(\Psi))$ . Then $\overline{\eta}_{\_}\uparrow\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\overline{\Upsilon}[\rho]\circ\eta$ satisfies Eq. 1.

Proof 3.5.

It is worth noting that, we trivially have by abstraction that $\forall\varphi\in\Psi.\>{\llbracket\eta_{\_}\uparrow(\varphi)\rrbracket}\subseteq{\llbracket\eta_{\_}\uparrow(\varphi)\rrbracket}^{\rho}$ . Let us prove the other implication: $\forall\varphi\in\Psi$

\begin{array}[]{llr}{\llbracket\eta_{\_}\uparrow(\varphi)\rrbracket}&={\llbracket\overline{\Upsilon}[\rho]\circ\eta(\varphi)\rrbracket}&\\ &={\llbracket\overline{\Upsilon}[\rho]\circ\overline{\Upsilon}[\rho]\circ\eta(\varphi)\rrbracket}&\qquad[\mbox{By properties of uco}]\\ &={\llbracket\overline{\Upsilon}[\rho]\circ\eta(\varphi)\rrbracket}^{\rho}&[\mbox{By Lemma.\leavevmode\nobreak\ \ref{lemmaImp}}]\\ &={\llbracket\eta_{\_}\uparrow(\varphi)\rrbracket}^{\rho}\end{array}

This result tells us that by taking a code abstraction more abstract than (or equal to) $\overline{\Upsilon}[\rho]$ , we guarantee that the abstract interpretation $\rho$ can be performed on the abstracted program (Eq. 1). We have so far proved that it is always possible to force Eq. 1, in order to make it possible to continue the analysis (observing $\rho$ ) also on the abstracted code. In the following we show how this framework can be integrated with the existing analysis of dynamic code [4] in order to improve its precision.

4 An Improved Dynamic Code Analysis

In this section we show how the constructive code abstraction characterization, provided in the previous section, can be used for representing the code approximation which soundly captures the potential code executed by a string-to-code statement. As we will show, without abstracting code, we cannot capture situations where the collecting semantics on strings generates sets of statements that cannot be represented by using the concrete syntax. Nevertheless, we must also observe that the analyzer cannot change dynamically with the generated code, hence the abstraction must be driven by the semantic property analyzed. This means that, without using the proposed framework, the analysis would surely be less precise in those situations where code abstraction becomes a necessity.

Let us summarize how we propose to exploit the framework:

$\bullet$

Consider a fixed semantic abstraction $\rho\in\mbox{\it uco}(\mathbb{V})$ and a corresponding static analyzer, designed in such a way that it can interpret also code abstracted by $\overline{\Upsilon}[\rho]$ .
$\bullet$

Analyze the program, and when an eval is met, extract the language of its argument. If the language is infinite (under specific conditions that we will discuss) build the abstract $\mathsf{CFG}$ approximating it and extract the corresponding code abstraction $\eta$ . In general, this code abstraction $\eta$ is not more abstract than $\overline{\Upsilon}[\rho]$ (the code abstraction already embedded in the static analyzer, depending only on $\rho$ );
$\bullet$

Build $\overline{\Upsilon}[\rho]\circ\eta$ in order to make also the generated code (approximated by the generated abstract $\mathsf{CFG}$ ) analyzable by the static analysis for $\rho$ .

Analyzing Dynamic Code.

Let $\rho$ be a static analysis performing in particular $\rho_{\mbox{\tiny\sl S}}\in uco(\wp(\mathbb{S}))$ on strings, where $\mathbb{S}=\mathcal{K}^{*}$ denotes strings over a finite alphabet $\mathcal{K}$ . Note that, our analyzer has to work on any (abstract) $\mathsf{CFG}$ that can be dynamically generated, hence it has to be designed with this purpose in mind. In particular, as we will show, we will generate only abstract $\mathsf{CFG}$ s with a code abstraction $\eta$ complete w.r.t. $\rho$ . This means, by construction, that $\eta$ must be more abstract than $\overline{\Upsilon}[\rho]$ , which means that each set of elements in $\eta$ corresponds to a subset of the elements (abstract predicates) of $\overline{\Upsilon}[\rho]$ . Hence, in order to guarantee to interpret predicates in any $\eta$ complete, it is sufficient to design the analyzer soundly interpreting any abstract predicate in $\overline{\Upsilon}[\rho]$ . For instance, $\overline{\Upsilon}[\mathsf{Sign}]$ is the abstraction containing all the predicates, involving integers, of the form x:=S, x<S, etc, with ${\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{S}}}}}}\in\mathsf{Sign}$ , e.g., an abstract predicate is ${\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=}}}}\mathbb{Z}^{+}$ , and the analyzer for $\mathsf{Sign}$ should be able to interpret also such abstract predicates.
Let x be the input string parameter of an eval statement, we denote by ${\mathcal{S}^{\rho_{\mbox{\tiny\sl S}}}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})$ the abstract value for x computed by the analysis on $\rho_{\mbox{\tiny\sl S}}$ . For example, suppose that the collection of values for the string x before the eval is ${{\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{a}}:=0}}}},\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{a}}:=1}}}}\}$ . By defining $\rho_{\mbox{\tiny\sl S}}$ as the $k$ -bounded string set abstract domain [2], with $k=2$ , ${{{\mathcal{S}^{\rho_{\mbox{\tiny\sl S}}}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})=\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{a}}:=0}}}},\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{a}}:=1}}}}\}$ , while by using the prefix abstract domain $\overline{\mathcal{PR}}$ [9], ${{\mathcal{S}^{\mbox{\tiny$\overline{\mathcal{PR}}$}}(\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}}}}})=\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{a}}:={\@listingGroup{ltx_lst_identifier}{s}}}}}}\left|\begin{array}[]{l}s\in\mathbb{S}\end{array}\right.\!\right\}$ . When the abstracted string and the abstraction is clear from the context, we simply denote this set by $\mathcal{S}$ and we assume (for the sake of simplicity) that any string in $\mathcal{S}$ is an executable language statement¹⁰¹⁰10Note that, this assumption corresponds to a decidable condition, hence it is possible to check it and to implement ad hoc solutions when it does not hold.. In the following, we abuse notation by denoting $\mathcal{S}$ also the automaton recognizing the language.
Consider for example, the program reported in Fig. 5(a), a program building and manipulating the string str at run-time, which is, afterwards, interpreted as executable code, being the input parameter of the string-to-code statement eval. Since the value of N is unknown at compile-time, we cannot predict the precise number of iterations of the while-loop. In this case, a suitable string abstract analysis would approximate the value of str, before the eval execution, to an abstract value corresponding to an over-approximation of the possible values for str, which may be also, due to abstraction, an infinite set of strings, and therefore an infinite set of possible programs.

For instance, in the example, if we abstract strings into the regular expression abstract domain [8] (or equivalently into the finite state automata abstract domain [3]), the value of str after the while loop will be the abstract value $\mathtt{x:=5(5)^{*};}$ corresponding to an infinite set of programs, i.e., x:=5;, x:=55, x:=555;…. In this case, the common practice for analyzing eval is simply to give up with the analysis, for example by halting the analysis throwing an exception [17] or forbidding its usage [18].

Let $\rho_{\scriptscriptstyle\mathcal{CS}}$ be the abstract domain for all the possible values (integers, strings and booleans) [4]. Note that, $\overline{\Upsilon}[\rho_{\scriptscriptstyle\mathcal{CS}}]$ contains, for integers, predicates like the ones in the abstract $\mathsf{CFG}$ in Fig. 4.
The analysis $\rho_{\scriptscriptstyle\mathcal{CS}}$ at point $\ell_{\_}3$ , due to widening¹¹¹¹11Widening is a fix-point accelerator used in infinite domains with infinite ascending chains, namely where the semantic fix-point computation may diverge. In this case we use a widening on automata defined in [8] applied in the analysis of the while loop [3], abstracts the value of str in the infinite language ${\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{s}}}}}}\left|\begin{array}[]{l}s\in(5)^{+}\end{array}\right.\!\right\}$ (namely x is assigned to any value represented by a finite sequence of $5$ ). Hence, at point $\ell_{\_}8$ the analysis abstracts str to the strings set ${\mathcal{S}_{\_}{\mathtt{str}}=\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{if}}({\@listingGroup{ltx_lst_identifier}{x}}\textless 5)\textbraceleft{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{s}}\textbraceright{\@listingGroup{ltx_lst_identifier}{else}}\textbraceleft{\@listingGroup{ltx_lst_identifier}{x}}:=1\textbraceright}}}}\left|\begin{array}[]{l}s\in(5)^{+}\end{array}\right.\!\right\}$ meaning that, the true-branch of the string that may be transformed by eval may be either x:=5, or x:=55, or ${\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=555}}}},\dots$ . The automaton corresponding to the abstract value of str is reported in Fig. 6, and it denotes an infinite language, i.e., an infinite set of possible statements. Unfortunately, this is a problem for the analysis provided in [4], where the language containing all the possible strings would be returned, losing any precision.

Generating the Code: From Automata to $\mathsf{CFG}$ s.

At this point, we have the (potentially infinite) language of the eval argument (and hence an automaton $\mathcal{S}$ ), and the goal is to generate a $\mathsf{CFG}$ modeling an over-approximation of the executable code contained in the language of the automaton $\mathcal{S}$ . The idea is to generate a $\mathsf{CFG}$ from a language of strings, i.e., from an automaton, by performing a parsing on the paths of the automaton. Indeed, we have defined and implemented an algorithm¹²¹²12In the following, we only discuss the main parts of the algorithm for space limitations., reported in Alg. 1, performing an abstract parser on automata that, given an automaton $\mathcal{S}$ , returns the $\mathsf{CFG}$ $\mathcal{P}$ that over-approximates, for each $s\in\mathcal{S}$ (executable), the concrete execution of eval.

The idea of Alg. 1 is to perform a depth-first search on the automaton and, when a language statement is recognized, to generate an edge in the $\mathsf{CFG}$ . This phase is handled by lines 3-13 of Alg. 1, building the set of nodes Nodes and the set of edges Edges of the resulting $\mathsf{CFG}$ $\mathcal{P}$ . The set $W$ contains the states of the finite state automaton for which we still have to generate edges in the $\mathsf{CFG}$ and it is initialized, at line 2, with the initial state $q_{\_}0$ . At this point, Alg. 1 looks for language statements readable from any path of the input automaton starting from a state $q$ , taken from $W$ , by means of the module $\mathrm{ReduceStmts}$ (line 5). In particular, $\mathrm{ReduceStmts}$ returns a set of triples $(q^{\prime},\mbox{\tt c},q^{\prime\prime})$ , where each returned triple means that from $q^{\prime}\in Q$ to $q^{\prime\prime}\in Q$ a language statement c has been recognized.

Input:

\mathcal{S}=(Q,\mathcal{K},\delta,q_{\_}0,F)

Output:

\mathsf{CFG}

\mathcal{P}

over-approximating executable strings of

\mathcal{S}

\mathcal{S}=\mathrm{ReduceCycles}(\mathcal{S})

;

\mbox{\sl Nodes}\leftarrow\varnothing

;

\mbox{\sl Edges}\leftarrow\varnothing

;

W\leftarrow\{q_{\_}0\}

;

visited\leftarrow\varnothing

;

4while $W\neq\varnothing$ do

5 select and remove

q

from

W

;

stmts\leftarrow\mathrm{ReduceStmts}(\mathcal{S},q)

;

7 foreach $(q^{\prime},\mbox{\tt c},q^{\prime\prime})\in stmts$ do

\mbox{\sl Nodes}\leftarrow\mbox{\sl Nodes}^{\prime}\cup\{\mathsf{lab}(q^{\prime}),\mathsf{lab}(q^{\prime\prime})\}

;

\mbox{\sl Edges}\leftarrow\mbox{\sl Edges}\cup\{(\mathsf{lab}(q^{\prime}),\mbox{\tt c},\mathsf{lab}(q^{\prime\prime}))\}

;

visited\leftarrow visited\cup\{q^{\prime}\}

;

W\leftarrow W\cup\{q^{\prime\prime}\}

;

W\leftarrow W\smallsetminus visited

;

16 end foreach

18 end while

19return

\mathcal{P}=\langle\mbox{\sl Nodes},\mbox{\sl Edges}\rangle

;

Algorithm 1

The set returned by $\mathrm{ReduceStmts}$ corresponds to the set of statements of $\mathcal{P}$ readable from the state $q$ , hence they are added to Edges, substituting the reached states with the corresponding labels by means of the function $\mathsf{lab}$ (lines 7-8). At this point, we need to look for the statements that can be read from $q^{\prime\prime}$ , hence, $q^{\prime\prime}$ is added to $W$ in order to be eventually processed at the next iterations of the while loop at lines 3-13. When there are no more states of $\mathcal{S}$ to be processed, namely when $W$ is empty, the $\mathsf{CFG}$ $\mathcal{P}=\langle\mbox{\sl Nodes},\mbox{\sl Edges}\rangle$ is returned (line 14), with entry label $\mathsf{lab}(q_{\_}0)$ and exit labels the ones associated with the states in $F$ .

Problems arise when the automaton contains cycles (namely, when the automaton denotes an infinite language). In this case, Alg. 1 first transforms, at line 1, the input automaton, over the alphabet $\mathcal{K}$ , in an automaton without cycles, over the alphabet $\mathcal{K}\cup\wp(\mathcal{K}^{*})$ , by means of the module $\mathrm{ReduceCycles}$ . Given an input automaton $\mathcal{S}$ , we retrieve the cycles of $\mathcal{S}$ using the well-known Tarjan’s algorithm [27] for identifying cycles. Then, for each detected cycle of $\mathcal{S}$ , we check whether the string read by the cycle is a whole statement $\mathsf{r}$ or not. In the first case, we substitute the cycle of the string $\mathsf{r}$ in the automaton, i.e., $\mathsf{r}^{*}$ , with the automaton reading the string corresponding to the statement while(true){ $\mathsf{r}$ } over the alphabet $\mathcal{K}$ . Otherwise, if the cycle does not read a whole statement, the idea is to collapse the cycle in a single transition, labeled with the regular expression corresponding to what is read in the cycle, i.e., denoting a set of string on $\mathcal{K}$ ( $\wp(\mathcal{K}^{*})$ ). Hence the resulting automaton is on the alphabet $\mathcal{K}\cup\wp(\mathcal{K}^{*})$ . In Fig. 7 we report an example of application of $\mathrm{ReduceCycles}$ algorithm.

As example note that, by applying Alg. 1 to the automaton for $\mathcal{S}_{\_}{\mathtt{str}}$ in Fig. 6, we generate the $\mathsf{CFG}$ $\mathcal{P}_{\_}{\mathtt{str}}$ , depicted in Fig. 5(b). It is worth noting that the $\mathsf{CFG}$ obtained so far may contain abstract expressions on edges, hence edges may represent an infinite collection of statements. At this point, we need to approximate these edges for making it possible to analyze the $\mathsf{CFG}$ .

Making the Code Analyzable: Abstracting the $\mathsf{CFG}$ .

Let us recall that we have to perform the analysis $\rho$ also on the resulting code, in order to continue the static analysis. Hence, as observed before, we have to combine the code abstraction corresponding to the generated (abstract) $\mathsf{CFG}$ with the code abstraction induced by the semantic abstraction $\rho$ , i.e., $\overline{\Upsilon}[\rho]$ , which models, as code abstraction, the analysis.
First of all, we have to formally characterize the abstraction $\eta$ induced by the construction of the $\mathsf{CFG}$ given above, namely we characterize how the construction abstracts together different predicates. Let us build a code abstraction starting from the $\mathsf{CFG}$ $\mathcal{P}=\langle\mbox{\sl Nodes},\mbox{\sl Edges}\rangle$ built in Alg. 1: In particular, let $\mbox{\sl Merge}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\left\{\leavevmode\nobreak\ \left\{\leavevmode\nobreak\ \varphi\in\Psi\left|\begin{array}[]{l}\langle\ell^{\prime},\varphi,\ell^{\prime\prime}\rangle\in\mbox{\sl Edges}\end{array}\right.\!\right\}\left|\begin{array}[]{l}\ell^{\prime},\ell^{\prime\prime}\in\mbox{\sl Nodes}\end{array}\right.\!\right\}\subseteq\Psi$ be the set of collections of predicates between any pair of states in the $\mathsf{CFG}$ , we define

\eta^{\mathcal{P}}(\wp(\Psi))\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\wp(\left\{\leavevmode\nobreak\ X\in\mbox{\sl Merge}\left|\begin{array}[]{l}\forall Y\in\mbox{\sl Merge}\smallsetminus\{X\}.\>X\cap Y=\varnothing\end{array}\right.\!\right\})\in\mbox{\it uco}(\wp(\Psi))

(2)

Note that, this abstraction, being characterized starting from the $\mathsf{CFG}$ is defined only in terms of a finite subset of $\Psi$ , namely on the predicates in the given $\mathsf{CFG}$ , i.e., $\Psi^{\mathcal{P}}\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\stackrel{{\scriptstyle\mbox{\tiny def}}}{{\;=\;}}$}}\Psi\cap\left\{\leavevmode\nobreak\ \varphi\left|\begin{array}[]{l}\langle\ell^{\prime},\varphi,\ell^{\prime\prime}\rangle\in\mbox{\sl Edges}\end{array}\right.\!\right\}$ .
In the example, ${{{{\Psi^{\mathcal{P}_{\_}{\mathtt{str}}(\wp(\Psi))}=\{\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{s}}}}}}\left|\begin{array}[]{l}s\in(5)^{+}\end{array}\right.\!\right\},\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=1}}}}\},\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers({\@listingGroup{ltx_lst_identifier}{x}}\textless 5)}}}}\},\{\neg\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers({\@listingGroup{ltx_lst_identifier}{x}}\textless 5)}}}}\}\}$ , hence we have that $\eta^{\mathcal{P}_{\_}{\mathtt{str}}}=\wp(\Psi^{\mathcal{P}_{\_}{\mathtt{str}}})$ , being $\Psi^{\mathcal{P}_{\_}{\mathtt{str}}}$ already a partition. In Fig. 8(a) this abstraction is partially depicted.

(a)

(b)

Figure 8: (a) Code abstraction

\eta^{\mathcal{P}_{\_}{\mathtt{str}}}

w.r.t. the

\mathsf{CFG}

reported in Fig. 5(b), (b) Code abstraction

\overline{\Upsilon}[\mathsf{\rho_{\scriptscriptstyle\mathcal{CS}}}]^{\mathcal{P}_{\_}{\mathtt{str}}}

Finally, we need to satisfy Eq. 1 (completeness) between the code abstraction $\eta^{\mathcal{P}}$ , built so far, and the static analysis, modeled as a semantic abstraction $\rho$ , performing $\rho_{\mbox{\tiny\sl S}}$ (introduced above) on strings. Clearly we have no guarantee that $\eta^{\mathcal{P}}$ satisfies Eq. 1, hence, we have to (further) abstract the $\mathsf{CFG}$ in order to guarantee completeness w.r.t. the performed static analysis, namely in order to make it possible to perform the given static analysis on the code in the generated $\mathsf{CFG}$ . As observed in the previous section, in order to force completeness, we have to combine the desired abstraction $\eta^{\mathcal{P}}$ on predicates, with the code abstraction $\overline{\Upsilon}[\rho]$ . Formally, in order to allow this operation, since $\eta^{\mathcal{P}}$ is defined on $\Psi^{\mathcal{P}}$ , we have to restrict also $\overline{\Upsilon}[\rho]$ on $\Psi^{\mathcal{P}}$ (denoted $\overline{\Upsilon}[\rho]^{\mathcal{P}}$ ). This abstraction is obtained by intersecting the meaning of each one of its elements (i.e., its concretization) with the set of predicates in the $\mathsf{CFG}$ . In the running example, we have to compute $\overline{\Upsilon}[\mathsf{\rho_{\scriptscriptstyle\mathcal{CS}}}]^{\mathcal{P}_{\_}{\mathtt{str}}}$ , which is the code abstraction induced by the $\mathsf{Sign}$ on the predicates in $\mathcal{P}_{\_}{\mathtt{str}}$ . For instance, all the predicates in ${\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{s}}}}}}\left|\begin{array}[]{l}s\in(5)^{+}\end{array}\right.\!\right\}$ and the predicate x:=1 cannot be distinguished when integers are abstracted by observing only their signs, hence the resulting abstraction is depicted in Fig. 8(b), where the abstract predicate x:= $\mathbb{Z}^{+}$ corresponds, in the concrete, to the set of predicates ${{\left\{\leavevmode\nobreak\ \mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:={\@listingGroup{ltx_lst_identifier}{s}}}}}}\left|\begin{array}[]{l}s\in(5)^{+}\end{array}\right.\!\right\}\cup\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{x}}:=1}}}}\}$ , while x< $\mathbb{Z}^{+}$ and $\neg$ (x< $\mathbb{Z}^{+}$ ) correspond, respectively, to ${\{\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers({\@listingGroup{ltx_lst_identifier}{x}}\textless 5)}}}}\}$ and to ${\{\neg\mbox{\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@numbers({\@listingGroup{ltx_lst_identifier}{x}}\textless 5)}}}}\}$ (all the other elements corresponds to $\bot$ ).

Finally, we aim at building a code abstraction which can be interpreted by the initial abstract interpreter $\rho$ , namely, that satisfies Eq. 1. By Th. 3.4 such an abstraction is $\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\overline{\eta}_{\_}\uparrow^{\mathcal{P}}=\overline{\Upsilon}[\rho]^{\mathcal{P}}$}}\circ\eta^{\mathcal{P}}$ .

Corollary 4.1.

Let $\rho\in\mbox{\it uco}(\mathbb{V})$ be additive. Then the code abstraction $\overline{\eta}_{\_}\uparrow^{\mathcal{P}}=\overline{\Upsilon}[\rho]^{\mathcal{P}}\circ\eta^{\mathcal{P}}\in\mbox{\it uco}(\Psi^{\mathcal{P}})$ is complete w.r.t. the semantic abstraction $\rho$ , i.e., it satisfies Eq. 1.

Hence, in our example, the code abstraction $\mbox{\raisebox{0.0pt}[4.30554pt][4.30554pt]{$\overline{\eta}_{\_}\uparrow^{\mathcal{P}_{\_}{\mathtt{str}}}=\overline{\Upsilon}[\mathsf{\rho_{\scriptscriptstyle\mathcal{CS}}}]^{\mathcal{P}_{\_}{\mathtt{str}}}$}}\circ\eta^{\mathcal{P}_{\_}{\mathtt{str}}}$ satisfies Eq. 1. In particular, we can observe that $\overline{\eta}_{\_}\uparrow^{\mathcal{P}_{\_}{\mathtt{str}}}=\overline{\Upsilon}[\mathsf{\rho_{\scriptscriptstyle\mathcal{CS}}}]^{\mathcal{P}_{\_}{\mathtt{str}}}$ . Finally, we have to abstract the $\mathsf{CFG}$ $\mathcal{P}$ , previously generated, by applying $\overline{\eta}_{\_}\uparrow^{\mathcal{P}}$ to each edge of the $\mathsf{CFG}$ . In our example, the so far resulting abstract $\mathsf{CFG}$ is reported in Fig. 9, where the abstract $\mathsf{CFG}$ generated by abstracting $\mathcal{P}_{\_}{\mathtt{str}}$ by means of $\overline{\eta}_{\_}\uparrow^{\mathcal{P}_{\_}{\mathtt{str}}}$ is depicted.

A Taste of Implementation.

A static analyzer based on finite state automata is available at [3]. Moreover, we have implemented Alg. 1 in order to validate our approach¹³¹³13Available at
https://github.com/SPY-Lab/java-fsm-library/tree/abstract-parser. The implementation of a static analysis of abstract $\mathsf{CFG}$ s is in an early stage development and it is left as future work. Nevertheless, it is able to parse executable automata and to abstract them into abstract $\mathsf{CFG}$ s, as we have previously described. In order to make these abstract $\mathsf{CFG}$ s effectively analyzable, we are currently extending the static analyzer, and the underlying abstract interpreter, to parse, and thus analyze, also abstract predicates.

5 Conclusion

We conclude by highlighting the value, in the context of static analysis, of the framework presented in this paper. What we propose here is a precision improvement of [4], an analysis that attacks an extremely hard problem in static program analysis by abstract interpretation, since the standard static analysis assumption (i.e., the program code we want to analyze must be static) is broken when we have to deal with string-to-code statements. In [4], we have shown that even without this assumption, it is still possible for static analysis to semantically analyze dynamically mutating code in a meaningful and sound way. It has been the very first proof of concept for a sound static analysis for self-modifying code based on bounded reflection for a high-level script-like programming language. In this paper, we improve this approach by characterizing code transformations that do not lose precision w.r.t. a fixed abstract semantics/analysis of the code. The idea we develop consists of embedding the property to analyze in the code transformation in order to make the property analysis work also on the transformed code (as it happens in dynamic code analysis). Hence, the main contribution is to make even more precise the first truly dynamic static analyzer, which has the feature to keep the analysis going on, even when code is dynamically built.
Clearly, the framework improved here is still at an early stage and surely there is much work to do, not only for the presented algorithm and the implementation, which has clearly to be further developed but also for making the approach more precise and general. As far as the algorithm is concerned we have not explicitly provided soundness and completeness proofs or discussions. In particular, completeness holds under decidable hypotheses (the input automaton has to recognize only executable strings), here only briefly treated, and therefore these aspects need further formal development.
On the other hand, a direction for improving precision can be that of integrating the proposed static analysis in a hybrid solution, by using, for instance, taint analysis (or other dynamic analyses) for driving when to apply static analysis, or considering more advanced forms of automata-based domains for abstracting strings, such as the one reported in [24]. Finally, we have considered only eval as a string-to-code statement, while there are other ways, for dynamically executing code built out of strings, that should be investigated. However, we strongly believe that the same approach used for eval, could be easily applied to any other string-to-code statement. Moreover, we believe that this framework could be instantiated in order to deal with other forms of code transformations, maybe by considering more general $\mathsf{CFG}$ abstractions.

From a more theoretical point of view, interesting future works consist of exploiting the proposed approach for analyzing code in order to investigate, on dynamic languages, several application contexts where static analysis by abstract interpretations has been exploited. First of all, we could trace (abstract) flows of information during execution [15, 21, 19, 20, 13, 12, 11] in order to tackle different security issues, such as the detection of (abstract) code injections [7, 6] or the formal characterization of dynamic code obfuscators and of their potency [10, 14]. Moreover, the ability to analyze malware code could be exploited for extracting code properties which could be used for analyzing code similarity [25], a technique useful for instance to identify or at least classify malicious code.

References

[1]
[2] Roberto Amadini, Graeme Gange, François Gauthier, Alexander Jordan, Peter Schachte, Harald Søndergaard, Peter J. Stuckey & Chenyi Zhang (2018): Reference Abstract Domains and Applications to String Analysis. Fundam. Informaticae 158(4), pp. 297–326, 10.3233/FI-2018-1650.
[3] Vincenzo Arceri & Isabella Mastroeni (2019): An Automata-based Abstract Semantics for String Manipulation Languages. In Alexei Lisitsa & Andrei P. Nemytykh, editors: Proceedings Seventh International Workshop on Verification and Program Transformation, VPT@Programming 2019, Genova, Italy, 2nd April 2019, EPTCS 299, pp. 19–33, 10.4204/EPTCS.299.5.
[4] Vincenzo Arceri & Isabella Mastroeni (2021): Analyzing Dynamic Code: A Sound Abstract Interpreter for Evil Eval. ACM Trans. Priv. Secur. 24(2), pp. 10:1–10:38, 10.1145/3426470.
[5] Vincenzo Arceri, Isabella Mastroeni & Sunyi Xu (2020): Static Analysis for ECMAScript String Manipulation Programs. Appl. Sci. 10, p. 3525, 10.3390/app10103525.
[6] Musard Balliu & Isabella Mastroeni (2010): A Weakest Precondition Approach to Robustness. Trans. Comput. Sci. 10, pp. 261–297, 10.1007/978-3-642-17499-5_11.
[7] Samuele Buro & Isabella Mastroeni (2018): Abstract Code Injection - A Semantic Approach Based on Abstract Non-Interference. In Isil Dillig & Jens Palsberg, editors: Verification, Model Checking, and Abstract Interpretation - 19th International Conference, VMCAI 2018, Los Angeles, CA, USA, January 7-9, 2018, Proceedings, Lecture Notes in Computer Science 10747, Springer, pp. 116–137, 10.1007/978-3-319-73721-8_6.
[8] Tae-Hyoung Choi, Oukseh Lee, Hyunha Kim & Kyung-Goo Doh (2006): A Practical String Analyzer by the Widening Approach. In Naoki Kobayashi, editor: Programming Languages and Systems, 4th Asian Symposium, APLAS 2006, Sydney, Australia, November 8-10, 2006, Proceedings, Lecture Notes in Computer Science 4279, Springer, pp. 374–388, 10.1007/11924661_23.
[9] Giulia Costantini, Pietro Ferrara & Agostino Cortesi (2015): A suite of abstract domains for static analysis of string values. Softw. Pract. Exp. 45(2), pp. 245–287, 10.1002/spe.2218.
[10] Roberto Giacobazzi, Neil D. Jones & Isabella Mastroeni (2012): Obfuscation by partial evaluation of distorted interpreters. In Oleg Kiselyov & Simon J. Thompson, editors: Proceedings of the ACM SIGPLAN 2012 Workshop on Partial Evaluation and Program Manipulation, PEPM 2012, Philadelphia, Pennsylvania, USA, January 23-24, 2012, ACM, pp. 63–72, 10.1145/2103746.2103761.
[11] Roberto Giacobazzi & Isabella Mastroeni (2004): Proving Abstract Non-interference. In Jerzy Marcinkowski & Andrzej Tarlecki, editors: Computer Science Logic, 18th International Workshop, CSL 2004, 13th Annual Conference of the EACSL, Karpacz, Poland, September 20-24, 2004, Proceedings, Lecture Notes in Computer Science 3210, Springer, pp. 280–294, 10.1007/978-3-540-30124-0_23.
[12] Roberto Giacobazzi & Isabella Mastroeni (2010): Adjoining classified and unclassified information by abstract interpretation. J. Comput. Secur. 18(5), pp. 751–797, 10.3233/JCS-2009-0382.
[13] Roberto Giacobazzi & Isabella Mastroeni (2010): A Proof System for Abstract Non-interference. J. Log. Comput. 20(2), pp. 449–479, 10.1093/logcom/exp053.
[14] Roberto Giacobazzi & Isabella Mastroeni (2012): Making Abstract Interpretation Incomplete: Modeling the Potency of Obfuscation. In Antoine Miné & David Schmidt, editors: Static Analysis - 19th International Symposium, SAS 2012, Deauville, France, September 11-13, 2012. Proceedings, Lecture Notes in Computer Science 7460, Springer, pp. 129–145, 10.1007/978-3-642-33125-1_11.
[15] Roberto Giacobazzi & Isabella Mastroeni (2018): Abstract Non-Interference: A Unifying Framework for Weakening Information-flow. ACM Trans. Priv. Secur. 21(2), pp. 9:1–9:31, 10.1145/3175660.
[16] Roberto Giacobazzi & Elisa Quintarelli (2001): Incompleteness, Counterexamples, and Refinements in Abstract Model-Checking. In Patrick Cousot, editor: Static Analysis, 8th International Symposium, SAS 2001, Paris, France, July 16-18, 2001, Proceedings, Lecture Notes in Computer Science 2126, Springer, pp. 356–373, 10.1007/3-540-47764-0_20.
[17] Simon Holm Jensen, Peter A. Jonsson & Anders Møller (2012): Remedying the eval that men do. In Mats Per Erik Heimdahl & Zhendong Su, editors: International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, July 15-20, 2012, ACM, pp. 34–44, 10.1145/2338965.2336758.
[18] Vineeth Kashyap, Kyle Dewey, Ethan A. Kuefner, John Wagner, Kevin Gibbons, John Sarracino, Ben Wiedermann & Ben Hardekopf (2014): JSAI: a static analysis platform for JavaScript. In Shing-Chi Cheung, Alessandro Orso & Margaret-Anne D. Storey, editors: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, ACM, pp. 121–132, 10.1145/2635868.2635904.
[19] Isabella Mastroeni (2013): Abstract interpretation-based approaches to Security - A Survey on Abstract Non-Interference and its Challenging Applications. In Anindya Banerjee, Olivier Danvy, Kyung-Goo Doh & John Hatcliff, editors: Semantics, Abstract Interpretation, and Reasoning about Programs: Essays Dedicated to David A. Schmidt on the Occasion of his Sixtieth Birthday, Manhattan, Kansas, USA, 19-20th September 2013, EPTCS 129, pp. 41–65, 10.4204/EPTCS.129.4.
[20] Isabella Mastroeni & Durica Nikolic (2010): Abstract Program Slicing: From Theory towards an Implementation. In Jin Song Dong & Huibiao Zhu, editors: Formal Methods and Software Engineering - 12th International Conference on Formal Engineering Methods, ICFEM 2010, Shanghai, China, November 17-19, 2010. Proceedings, Lecture Notes in Computer Science 6447, Springer, pp. 452–467, 10.1007/978-3-642-16901-4_30.
[21] Isabella Mastroeni & Damiano Zanardini (2017): Abstract Program Slicing: An Abstract Interpretation-Based Approach to Program Slicing. ACM Trans. Comput. Log. 18(1), pp. 7:1–7:58, 10.1145/3029052.
[22] Nikos Mavrogiannopoulos, Nessim Kisserli & Bart Preneel (2011): A taxonomy of self-modifying code for obfuscation. Comput. Secur. 30(8), pp. 679–691, 10.1016/j.cose.2011.08.007.
[23] Antoine Miné (2013): Static analysis by abstract interpretation of concurrent programs. (Analyse statique par interprétation abstraite de programmes concurrents). Available at https://tel.archives-ouvertes.fr/tel-00903447.
[24] Luca Negrini, Vincenzo Arceri, Pietro Ferrara & Agostino Cortesi (2021): Twinning Automata and Regular Expressions for String Static Analysis. In Fritz Henglein, Sharon Shoham & Yakir Vizel, editors: Verification, Model Checking, and Abstract Interpretation - 22nd International Conference, VMCAI 2021, Copenhagen, Denmark, January 17-19, 2021, Proceedings, Lecture Notes in Computer Science 12597, Springer, pp. 267–290, 10.1007/978-3-030-67067-2_13.
[25] Mila Dalla Preda, Roberto Giacobazzi, Arun Lakhotia & Isabella Mastroeni (2015): Abstract Symbolic Automata: Mixed syntactic/semantic similarity analysis of executables. In Sriram K. Rajamani & David Walker, editors: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015, ACM, pp. 329–341, 10.1145/2676726.2676986.
[26] Gregor Richards, Christian Hammer, Brian Burg & Jan Vitek (2011): The Eval That Men Do - A Large-Scale Study of the Use of Eval in JavaScript Applications. In Mira Mezini, editor: ECOOP 2011 - Object-Oriented Programming - 25th European Conference, Lancaster, UK, July 25-29, 2011 Proceedings, Lecture Notes in Computer Science 6813, Springer, pp. 52–78, 10.1007/978-3-642-22655-7_4.
[27] Robert Endre Tarjan (1972): Depth-First Search and Linear Graph Algorithms. SIAM J. Comput. 1(2), pp. 146–160, 10.1137/0201010.
[28] Reinhard Wilhelm, Helmut Seidl & Sebastian Hack (2013): Compiler Design - Syntactic and Semantic Analysis. Springer, 10.1007/978-3-642-17540-4.