Syntax over Semantics

I am a big believer in the appliation of semantics-based program analysis to binary analysis. In other words, I think that we should examine program semantics rather than program syntax. Syntax-oriented analyses operate at the assembly level, e.g.,

add %eax, %ebx

while semantics-oriented analyses look at a lifted form of assembly, like:

addr 0x0 @asm "add    %eax,%ebx"
label pc_0x0
T_t1:u32 = R_EBX:u32
T_t2:u32 = R_EAX:u32
R_EBX:u32 = R_EBX:u32 + T_t2:u32
R_CF:bool = R_EBX:u32 < T_t1:u32
R_AF:bool = 0x10:u32 == (0x10:u32 & (R_EBX:u32 ^ T_t1:u32 ^ T_t2:u32))
R_OF:bool = high:bool((T_t1:u32 ^ ~T_t2:u32) & (T_t1:u32 ^ R_EBX:u32))
R_PF:bool =
  ~low:bool(R_EBX:u32 >> 7:u32 ^ R_EBX:u32 >> 6:u32 ^ R_EBX:u32 >>
  5:u32 ^
            R_EBX:u32 >> 4:u32 ^ R_EBX:u32 >> 3:u32 ^ R_EBX:u32 >>
            2:u32 ^
            R_EBX:u32 >> 1:u32 ^ R_EBX:u32)
R_SF:bool = high:bool(R_EBX:u32)
R_ZF:bool = 0:u32 == R_EBX:u32

However, over the past few days, I’ve concluded that there are some cases in which it may be preferable to use syntax. To give you some perspective, I’ve been working on our VSA implementation, as I’ve described in my previous blog posts. One of the problems I have been working on is adding high level edge conditions at branch points.

What are high level edge conditions? Well, to really grok what that means, you need to see how BAP would model a branch. Let’s use the below program as an example, which merely does a compare and branch.

cmp %eax, %ebx
ja foo

Now, before we go further, stop and ask yourself “What is the high-level branch condition?” You probably just thought ebx > eax. And you’d be right! Let’s see what a semantics-oriented answer would be from BAP.

Here, the branch condition is ~(cf | zf). Gee, thanks BAP. If we apply copy propagation, which is essentially substitution, we get ~(ebx < eax | 0 == ebx - eax). Obviously, with a little simplification, this turns into ebx >= eax & ebx != eax, or ebx > eax.

My point here is that recovering the high-level condition is awkward. In semantics approaches, we lift from syntax to semantics, and yet to recover this high-level condition we are effectively undoing this transformation. I’ve seen this come up in a few places, with our VSA implementation only being the most recent.

It would be nice if there was a nice, principled way of recovering higher-level conditions. For example, if we could create a lattice that compares how “high level” expressions are, and automatically map an expression to a higher expression on the lattice, I’d feel much better about this problem. As things stand, I have defined a number of simplifications that get the job done, but they are somewhat specific to our lifting mechanism. If we change how we lift instructions, we could break the high-level condition recovery, which just seems ugly.

On the bright side, BAP’s VSA implementation is good enough to compute the target addresses of switch jump tables now. So hopefully tomorrow I’ll have CFG recovery at least partially working! Here’s the “new” version of the edge conditions for the above example:

Ed's Blog

A PhD Student's Musings

Syntax Over Semantics

Comments