How Do You Prove A Program Property?
It is widely known in the software world that proving semantic properties of programs is often an exercise in futility - a general solution is at least as hard as the Halting Problem (per Rice's Theorem) and once we've invoked the Halting Problem it's game over. The task is, as we say in the biz, undecidable.
Here are some examples of properties we might be interested in proving:
- whether a variable is a constant
- whether a conditional branch is never taken
- whether a buffer is never accessed past its boundaries
- etc.
Despite the odds being against us, sometimes we can prove semantic properties exactly; other times we can only approximate them - but well enough for practical use. This blog series is about Abstract Interpretation, a verification technique used to over-approximate program properties. In this first entry we'll set the stage and develop basic intuition for the use of abstractions when reasoning about programs. So let's start with an example.
The following C-code implements an opaque predicate - a computation that evaluates to a constant. For a conditional expression this means it's always true or always false regardless of the input arguments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
Opaque predicates are common in adversarial code and malware as an obfuscation technique (e.g. see Takahiro Haruyama's VirusBulletin article), though legitimate uses exist too.
The else branch in the code above is never taken, so we'd do well to remove it, thus making the program simpler and faster. But how can we be sure that foo always returns an even number?
To our human intelligence this may be obvious:
- If
xis even, thenx * (x - 1)is even too, as multiplying by an even number always gives us an even number. - If
xis odd, then(x - 1)must be even as decrementing an odd number gives us an even number.
Ideally we'd like to be able to prove this automatically. And there are three broad approaches we can take.
Brute Force
Basically run foo against all int32_t values and observe that it never produces an odd number. We'd have to run the program over four billion times - totally impractical, doesn't scale and mentioned here for completeness only.
SMT
Use a Satisfiability Modulo Theories (SMT) solver and, if we're lucky, it should magically give us the answer. Let's implement foo in Z3, a popular SMT solver (we'll use Python bindings as it's easier to follow compared to the original Lisp-like syntax).
1 2 3 4 5 6 7 | |
Running this code prints out no solution - this tells us that Z3 couldn't find a counterexample, which effectively proves that all evaluations of foo produce even bit vectors.
SMT-based approach to proving program properties is used in verifying compilers and Symbolic Execution with moderate success. It's not without downsides, however: it is computationally expensive and may time out on complex expressions. Symbolic Execution specifically does not have a baked-in strategy to deal with loops so its analyses don't have a clearly defined termination condition.
But what if we could have some of the SMT's rigor without the computational cost? This is where we might want to turn to abstraction.
For more information on SMT check out The Calculus of Computation by Bradley and Manna - an excellent introduction into the theory. If you want something more practical, check out angr, a popular symbolic execution framework.
Abstraction
If we don't want to use the heavy machinery of SMT solvers, we may revisit the brute force approach and see if we can make it practical. Recall that evenness means that the least significant bit (LSB) of a value is zero: 2 = 0b10, 4 = 0b100, 6 = 0b110 etc. Instead of thinking in terms of int32_t, let's introduce a new type: parity_t and a function to map or abstract our integers to it.
1 2 3 4 5 6 7 8 9 10 11 | |
Then we can redefine the multiplication and subtraction operators in terms of parity_t:
1 2 3 4 | |
The implementation of these two functions can be described by the following operation tables:
1 2 3 4 5 | |
Before we move further, we may want to ask ourselves whether this is a good abstraction. I think it is, for two reasons:
PARITY_EVENrepresents the set of all evenint32_tvalues, andPARITY_ODD- all the odd ones. Combined they cover the whole range of possible values, so we are not missing anything.- Parity only depends on the LSB. For any arithmetic or logical operation we genuinely don't care about the rest of the bit vectors involved.
Here's my lousy C code based on the tables above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Finally, we can abstract the function foo using the abstract operations above and execute it for every possible parity_t:
1 2 3 4 5 6 7 8 9 10 | |
This should produce the following output:
1 2 3 | |
Since we agreed earlier that this is a good abstraction, we have now proven that for every definite parity - both EVEN and ODD - the result is EVEN. And since PARITY_EVEN and PARITY_ODD cover the full range of int32_t, we have considered all possibilities.
UNKNOWN here is abstract (as opposed to the other two, which are concrete) - meaning it can be either odd or even, we just don't have enough information to tell. And, as we can see, our abstract domain is not powerful enough to figure out that foo_abstract(PARITY_UNKNOWN) must also return EVEN. This is because our parity_* functions treat their operands as independent. For instance, parity_mul doesn't "know" that x and x - 1 are related, and as such it outputs UNKNOWN. We'll talk about relational abstract domains at some point, but not for a minute.
A full parity analysis of, say, C would require implementing a parity_* variant for every operation the language supports, which is cumbersome, so nobody actually does that. Instead, this kind of analysis is done at the level of intermediate representation (IR). Modern compilers rarely generate assembly code directly, they typically generate some IR (e.g. LLVM), on which they perform analyses and optimizations, and then translate it to the underlying assembly code. Not only are IRs simple (usually under 100 operations), they also have accompanying translators that support most common architectures. Similarly, reverse engineering tools lift machine code to an IR. Later in the series we'll be using Ghidra and taking advantage of its excellent IR Pcode.