Ed's Blog

A PhD Student's Musings

Representing Binary Programs

On BAP

I’m one of the main developers of BAP, the Binary Analysis Platform. BAP has a design borrowed by modern optimizing compilers like LLVM and gcc. At the heart of these compilers is a language and architecture-independent intermediate language (IL). It’s important to have an independent IL, because it connects all the pieces of the compiler together.

The parser of a compiler reads high-level language code (like Java or C) and converts it to IL. Then, the IL is usually optimized, before a code generation algorithm converts the IL to platform-specific assembly code. The nice thing about independent IL is that the parser doesn’t need to know anything about the optimization or code generation routines, because they all speak the same “language”: IL!

Although BAP isn’t really a compiler, it looks a lot like one. Instead of parsing high-level languages, it reads binary code. This binary code is then lifted to IL, and then analyzed. (I recently wrote a code generator for BAP too, but that’s not publicly available yet).

The BAP IL is designed to be really simple. In fact, there’s only a few kinds of statements:

  • Assert.
  • Assignment.
  • Labels. A label names a line in the IL so that it can be jumped to.
  • Jumps and conditional jumps.
  • Special. Represents system calls and other special events.

Because there are so few types of statements in our IL, it’s really easy to analyze. Imagine writing analysis that works directly on x86 assembly. You’d have to implement your analysis for hundreds of instructions. Yuck. With BAP, you just write your analysis on the IL, which only has a few types of statements. And, the BAP IL is not tied to any one architecture, so your analysis will probably work on ARM and x86 if you’re careful.

Let’s look at some BAP IL. Here’s a simple sequence of x86 assembly instructions:

1
2
3
4
5
6
addl %eax, %ebx
shl %cl, %ebx
jc error

error:
  nop

Below is the BAP IL for this assembly code. It might seem overly verbose to you, but this actually makes things easier.

Sample BAP IL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
addr 0x0 @asm "add    %eax,%ebx"
label pc_0x0
T_t1:u32 = R_EBX:u32
T_t2:u32 = R_EAX:u32
R_EBX:u32 = R_EBX:u32 + T_t2:u32
R_CF:bool = R_EBX:u32 < T_t1:u32
R_AF:bool = 0x10:u32 == (0x10:u32 & (R_EBX:u32 ^ T_t1:u32 ^ T_t2:u32))
R_OF:bool = high:bool((T_t1:u32 ^ ~T_t2:u32) & (T_t1:u32 ^ R_EBX:u32))
R_PF:bool =
  ~low:bool(R_EBX:u32 >> 7:u32 ^ R_EBX:u32 >> 6:u32 ^ R_EBX:u32 >> 5:u32 ^
            R_EBX:u32 >> 4:u32 ^ R_EBX:u32 >> 3:u32 ^ R_EBX:u32 >> 2:u32 ^
            R_EBX:u32 >> 1:u32 ^ R_EBX:u32)
R_SF:bool = high:bool(R_EBX:u32)
R_ZF:bool = 0:u32 == R_EBX:u32
addr 0x2 @asm "shl    %cl,%ebx"
label pc_0x2
T_origDEST:u32 = R_EBX:u32
T_origCOUNT:u32 = R_ECX:u32 & 0x1f:u32
R_EBX:u32 = R_EBX:u32 << (R_ECX:u32 & 0x1f:u32)
R_CF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_CF:bool else
  low:bool(T_origDEST:u32 >> 0x20:u32 - T_origCOUNT:u32)
R_OF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_OF:bool else
  if T_origCOUNT:u32 == 1:u32 then high:bool(R_EBX:u32) ^ R_CF:bool else
  unknown "OF undefined after shift":bool
R_SF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_SF:bool else high:bool(R_EBX:u32)
R_ZF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_ZF:bool else 0:u32 == R_EBX:u32
R_PF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_PF:bool else
  ~low:bool(R_EBX:u32 >> 7:u32 ^ R_EBX:u32 >> 6:u32 ^ R_EBX:u32 >> 5:u32 ^
            R_EBX:u32 >> 4:u32 ^ R_EBX:u32 >> 3:u32 ^ R_EBX:u32 >> 2:u32 ^
            R_EBX:u32 >> 1:u32 ^ R_EBX:u32)
R_AF:bool =
  if T_origCOUNT:u32 == 0:u32 then R_AF:bool else
  unknown "AF undefined after shift":bool
addr 0x4 @asm "jb     0x0000000000000006"
label pc_0x4
cjmp R_CF:bool, 6:u32, "nocjmp0"
label nocjmp0
addr 0x6 @asm "nop"
label pc_0x6

BAP is very good at representing the effects of binary code such as the snippet shown above. Notice it explicitly computes the values of all of the processor’s status flags that change, since these are used for control flow. It can also represent an entire binary program in terms of the program’s control flow; this is called a Control Flow Graph (CFG). Here’s what the CFG for the above assembly program looks like:

Each edge represents a possible transfer of control flow.

We can see from the CFG that even though there is a conditional jump in the assembly code at address 0x4, the destination will be the same regardless of whether the jump is taken! (This is because I made a quick toy example; this would not usually happen.)

Function Analysis

What happens when we start to reason about functions? Let’s say we use this assembly file:

1
2
3
4
f1:     call f2
        ret
f2:     mov $42, %eax
        ret

We get the following CFG:

But something has gone wrong! In the assembly code, we see that f1 calls f2, and then executes a ret instruction. But in the CFG, we call f2, and then the ret jumps to BB_Indirect, which points to every block! BB_Indirect essentially means we can’t narrow down exactly where the jump will go. This makes sense for a ret instruction because a well-behaved function should return control to its caller. Since there could be multiple call sites, we can’t narrow the target of the ret down to a single address.

In some situations, this generic treatment of jumps is desirable. For instance, it allows us to reason about buffer overflows, because it does not assume functions always return to their callers. However, in some scnearios, we may know (or hope) that functions are well behaved, and we should be able to use this knowledge to make analysis easier.

Representing Functions in BAP

First, I read over the Boogie paper to see what ideas we could borrow from them. Boogie does a ton of things, and so it is a superset of the BAP IL, and most likely always will be. Even so, since it’s goal is to enable verification, and some of the work we do in BAP is verification or like verification, we can benefit from some of the decisions.

At a high level, all code in Boogie happens in functions. There are global variables and local variables, which are accessible only by the function they are defined in. Interestingly, functions that modify global variables must explicitly use a Modifies clause to indicate that they change the variable. This obviously helps scale inter-function analysis. However, imagine if a function modifies a global variable memory that represents all of memory. The Modifies clause doesn’t distinguish from memory being completely zeroed out and one bit being changed. Boogie addresses this with frame properties, which are essentially postconditions that state some part of the input state remained unchanged. For instance, we could say \forall a : addr. a != 0x1234 => memory[a] = old(memory[a]) to say that memory at all addresses other than 0x1234 did not change. This makes perfect sense in the context of verification, since the end goal is to create a verification condition (VC) and prove it with a theorem prover. These frame conditions serve as hints to the theorem prover.

BAP can be used for non-verification tasks too, though. For instance, we might want to run a data flow analysis over a binary program, or optimize it. These tasks could also benefit from knowing what parts of a memory can be modified by a function. However, since these tasks don’t operate on logical formulas, the frame conditions don’t really help here. Instead, we’d need to record this information more explicitly. One way to do this would to be explicitly describe address expressions that can change, like R_ESP:u32 - 0x8:u32 or 0x1234:u32.

Boogie also allows each function to be tagged with pre and post-conditions. This seems like a good idea. For instance, in BAP we might add the post-condition to make sure the binary code that we thought was a function actually behaves like a function. What if we could prove function-like behavior using abstract interpretation or something similar? Then we could add the constraint as a free or unchecked postcondition, which is taken as an axiom.

Like Boogie, I think it’s a good idea to represent a program as a set of functions (and a designated start function). Any instructions that don’t seem like they have function-like behavior can be included in the start function.

One last issue is how we identify functions. We can use debug symbols if they are present. If not, we can use guess and check. For instance, we could guess the start and end of functions by doing a quick analysis of call and ret instructions. Or, we could use randomized testing to look for ret like behavior. Once we have a guess, we could use abstract interpretation or verification conditions to verify that our guess is right.

Comments