I’ve often noticed that hard problems are often found in many forms in seemingly irrelevant projects. I ran into an interesting one the other week. I am running some experiments that test verification conditions (VCs), which are algorithms that produce a formula that can be tested with a SMT solver to demonstrate some forms of program correctness. VCs are a very useful technique right now because there are some really great SMT solvers out there – we often use STP, Yices, Boolector, Z3, and CVC3.
You might be familiar with a very common VC, Forward Symbolic Execution (FSE). Dijkstra also created his own VC, called Weakest Precondition (WP), which is generally defined on a language called the Guarded Command Language (GCL). The specifics of the GCL do not matter for this blog post. What matters is that it is a structured programming language; that is, it contains no jumps or goto statements. The only control flow operators are the sequence operator, and a conditional operator. Many followup algorithms for WP have followed in the footsteps of Dijkstra and defined their algorithms to work on the GCL.
The problem is that the GCL is not a real programming language, and so
we must first convert programs to GCL to run these algorithms on
them. Of course, for me, this means converting BAP programs to GCL.
It turns out that my adviser solved this problem while working on his
PhD. His dissertation gives an algorithm that visits a program in CFG
form in reverse topological order. When visiting a conditional branch
node n
that has
two successors s1
and s2
, the GCL programs gcl1
and gcl2
are
computed from s1
and s2
(don’t worry about the details). gcl1
and gcl2
are split into three parts: a common suffix, and a prefix
for gcl1
and gcl2
that is not shared. This is then converted to
(if (e) gclprefix1 else gclprefix2); suffix
. The idea is to avoid
duplicating nodes in the GCL as much as possible.
The problem I ran into was when I started looking at large programs, the GCL size was growing exponentially. After looking at some of the examples, I realized this was a problem I had seen before! The above algorithm is actually performing structural analysis: it is taking an unstructured program (think of a CFG) and trying to give it structure in the form of if-then-else or sequences.
A few months ago I was working on a more general structural analysis algorithm. One interesting thing we found is that the short-circuit behavior from high-level languages causes unstructured CFGs at the binary level. For instance, the program:
1 2 3 4 |
|
creates the following CFG:
Notice that BB_6 and BB_7 are reachable from BB_5 but not dominated by it; this is not good for structural analysis!
The high-level point of this post is that many hard problems are related in some way. While working on the VCs, I had already concluded that algorithms that avoid the GCL conversion process entirely are preferable. I probably would not have taken the time to figure out precisely why the GCL conversion had so much code duplication in it, had I not remembered seeing a similar problem before. It always pays to try to relate seemingly new hard problems back to similar ones that you’ve solved or seen solved.