I’m one of the main developers of BAP, the Binary Analysis Platform. BAP has a design borrowed by modern optimizing compilers like LLVM and gcc. At the heart of these compilers is a language and architecture-independent intermediate language (IL). It’s important to have an independent IL, because it connects all the pieces of the compiler together.
The parser of a compiler reads high-level language code (like Java or C) and converts it to IL. Then, the IL is usually optimized, before a code generation algorithm converts the IL to platform-specific assembly code. The nice thing about independent IL is that the parser doesn’t need to know anything about the optimization or code generation routines, because they all speak the same “language”: IL!
Although BAP isn’t really a compiler, it looks a lot like one. Instead of parsing high-level languages, it reads binary code. This binary code is then lifted to IL, and then analyzed. (I recently wrote a code generator for BAP too, but that’s not publicly available yet).
The BAP IL is designed to be really simple. In fact, there’s only a few kinds of statements:
- Labels. A label names a line in the IL so that it can be jumped to.
- Jumps and conditional jumps.
- Special. Represents system calls and other special events.
Because there are so few types of statements in our IL, it’s really easy to analyze. Imagine writing analysis that works directly on x86 assembly. You’d have to implement your analysis for hundreds of instructions. Yuck. With BAP, you just write your analysis on the IL, which only has a few types of statements. And, the BAP IL is not tied to any one architecture, so your analysis will probably work on ARM and x86 if you’re careful.
Let’s look at some BAP IL. Here’s a simple sequence of x86 assembly instructions:
1 2 3 4 5 6
Below is the BAP IL for this assembly code. It might seem overly verbose to you, but this actually makes things easier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
BAP is very good at representing the effects of binary code such as the snippet shown above. Notice it explicitly computes the values of all of the processor’s status flags that change, since these are used for control flow. It can also represent an entire binary program in terms of the program’s control flow; this is called a Control Flow Graph (CFG). Here’s what the CFG for the above assembly program looks like:Each edge represents a possible transfer of control flow.
We can see from the CFG that even though there is a conditional jump in the assembly code at address 0x4, the destination will be the same regardless of whether the jump is taken! (This is because I made a quick toy example; this would not usually happen.)
What happens when we start to reason about functions? Let’s say we use this assembly file:
1 2 3 4
We get the following CFG:
But something has gone wrong! In the assembly code, we see that
f2, and then executes a
ret instruction. But in the CFG, we
f2, and then the
ret jumps to
BB_Indirect, which points to
BB_Indirect essentially means we can’t narrow down
exactly where the jump will go. This makes sense for a
instruction because a well-behaved function should return control to
its caller. Since there could be multiple call sites, we can’t narrow
the target of the
ret down to a single address.
In some situations, this generic treatment of jumps is desirable. For instance, it allows us to reason about buffer overflows, because it does not assume functions always return to their callers. However, in some scnearios, we may know (or hope) that functions are well behaved, and we should be able to use this knowledge to make analysis easier.
Representing Functions in BAP
First, I read over the Boogie paper to see what ideas we could borrow from them. Boogie does a ton of things, and so it is a superset of the BAP IL, and most likely always will be. Even so, since it’s goal is to enable verification, and some of the work we do in BAP is verification or like verification, we can benefit from some of the decisions.
At a high level, all code in Boogie happens in functions. There are
global variables and local variables, which are accessible only by the
function they are defined in. Interestingly, functions that modify
global variables must explicitly use a
Modifies clause to indicate
that they change the variable. This obviously helps scale
inter-function analysis. However, imagine if a function modifies a
memory that represents all of memory. The
Modifies clause doesn’t distinguish from memory being completely
zeroed out and one bit being changed. Boogie addresses this with
frame properties, which are essentially postconditions that state some
part of the input state remained unchanged. For instance, we could
\forall a : addr. a != 0x1234 => memory[a] = old(memory[a]) to
say that memory at all addresses other than 0x1234 did not change.
This makes perfect sense in the context of verification, since the end
goal is to create a verification condition (VC) and prove it with a
theorem prover. These frame conditions serve as hints to the theorem
BAP can be used for non-verification tasks too, though. For instance,
we might want to run a data flow analysis over a binary program, or
optimize it. These tasks could also benefit from knowing what parts
of a memory can be modified by a function. However, since these tasks
don’t operate on logical formulas, the frame conditions don’t really
help here. Instead, we’d need to record this information more
explicitly. One way to do this would to be explicitly describe
address expressions that can change, like
R_ESP:u32 - 0x8:u32 or
Boogie also allows each function to be tagged with pre and post-conditions. This seems like a good idea. For instance, in BAP we might add the post-condition to make sure the binary code that we thought was a function actually behaves like a function. What if we could prove function-like behavior using abstract interpretation or something similar? Then we could add the constraint as a free or unchecked postcondition, which is taken as an axiom.
Like Boogie, I think it’s a good idea to represent a program as a set of functions (and a designated start function). Any instructions that don’t seem like they have function-like behavior can be included in the start function.
One last issue is how we identify functions. We can use debug symbols
if they are present. If not, we can use guess and check. For
instance, we could guess the start and end of functions by doing a
quick analysis of
ret instructions. Or, we could use
randomized testing to look for
ret like behavior. Once we have a
guess, we could use abstract interpretation or verification conditions
to verify that our guess is right.