Lately, I’ve been thinking about binary instrumentation. Binary instrumentation is awesome. Dynamic binary instrumentation (DBI) frameworks like Pin allow you to effectively insert your own code in between a binary program’s instructions. Such a capability is obviously very powerful. In our research, we use PIN to record execution traces that we can then analyze with BAP.
PIN is a great tool. It has a nice API that makes many things easy, and I’ve never found an instruction it can’t handle. (It’s made by Intel – it figures they could actually completely model their own architecture.) PIN works by reading binary code, adding the user’s specified instrumentation, and then Just In Time (JIT) compiling the whole thing. This got me thinking: BAP can understand binary code and allow users to modify it using a visitor interface. But, the BAP interpreter is really slow.
How slow? Let’s create a simple program and find out:
1 2 3 4 5 6 7 8 |
|
This program computes the sum of the first 100,000 numbers. Shouldn’t take too long to execute, right?
1 2 3 4 5 |
|
Ouch! Almost 30 seconds. That is only ~4000 loop iterations per second. That got me thinking: How difficult is it to do JIT? It’s surprisingly easy! There are plenty of JIT frameworks to choose from these days. I chose to use LLVM, because I was already familiar with it, and because there is an OCaml LLVM interface. Because BAP and LLVM are fairly well designed, it only took me about 48 hours to implement a BAP IL to LLVM IL converter. Let’s re-run the JIT version of the code and see how long it takes.
1 2 3 4 5 |
|
Holy smokes that was fast! Let’s see how this got converted to LLVM IL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
And here is the x86 assembly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
One small issue is how to deal with memory: Sometimes we don’t want a
bad memory read or write to crash our whole evaluation. There are two
modes in the BAP to LLVM conversion. The first mode does no
sandboxing: a memory write in the BAP IL is translated directly to a LLVM
memory write. The second mode replaces all memory operations in the
BAP IL with calls to C++ functions that set and read a std::map
object.
Look for the LLVM JIT code to appear in a new BAP release coming soon!