I am a big believer in the appliation of semantics-based program analysis to binary analysis. In other words, I think that we should examine program semantics rather than program syntax. Syntax-oriented analyses operate at the assembly level, e.g.,
while semantics-oriented analyses look at a lifted form of assembly, like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
However, over the past few days, I’ve concluded that there are some cases in which it may be preferable to use syntax. To give you some perspective, I’ve been working on our VSA implementation, as I’ve described in my previous blog posts. One of the problems I have been working on is adding high level edge conditions at branch points.
What are high level edge conditions? Well, to really grok what that means, you need to see how BAP would model a branch. Let’s use the below program as an example, which merely does a compare and branch.
Now, before we go further, stop and ask yourself “What is the
high-level branch condition?” You probably just thought
eax. And you’d be right! Let’s see what a semantics-oriented answer
would be from BAP.
Here, the branch condition is
~(cf | zf). Gee, thanks BAP. If we
apply copy propagation, which is essentially substitution, we get
~(ebx < eax | 0 == ebx - eax). Obviously, with a little
simplification, this turns into
ebx >= eax & ebx != eax, or
My point here is that recovering the high-level condition is awkward. In semantics approaches, we lift from syntax to semantics, and yet to recover this high-level condition we are effectively undoing this transformation. I’ve seen this come up in a few places, with our VSA implementation only being the most recent.
It would be nice if there was a nice, principled way of recovering higher-level conditions. For example, if we could create a lattice that compares how “high level” expressions are, and automatically map an expression to a higher expression on the lattice, I’d feel much better about this problem. As things stand, I have defined a number of simplifications that get the job done, but they are somewhat specific to our lifting mechanism. If we change how we lift instructions, we could break the high-level condition recovery, which just seems ugly.
On the bright side, BAP’s VSA implementation is good enough to compute the target addresses of switch jump tables now. So hopefully tomorrow I’ll have CFG recovery at least partially working! Here’s the “new” version of the edge conditions for the above example: