Bits of Entropy - How Do You Prove A Program Property?

How Do You Prove A Program Property?

2026-03-07

It is widely known in the software world that proving semantic properties of programs is often an exercise in futility - a general solution is at least as hard as the Halting Problem (per Rice's Theorem) and once we've invoked the Halting Problem it's game over. The task is, as we say in the biz, undecidable.

Here are some examples of properties we might be interested in proving:

whether a variable is a constant
whether a conditional branch is never taken
whether a buffer is never accessed past its boundaries
etc.

Despite the odds being against us, sometimes we can prove semantic properties exactly; other times we can only approximate them - but well enough for practical use. This blog series is about Abstract Interpretation, a verification technique used to over-approximate program properties. In this first entry we'll set the stage and develop basic intuition for the use of abstractions when reasoning about programs. So let's start with an example.

The following C-code implements an opaque predicate - a computation that evaluates to a constant. For a conditional expression this means it's always true or always false regardless of the input arguments.

#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

// Always returns an even number.
int32_t foo(int32_t x) {
    // For brevity, let's assume this doesn't overflow as signed overflow is UB in C
    return x * (x - 1);
}

bool is_even(int32_t x) {
    return (x & 1) == 0;
}

int main(int argc, char* argv[]) {
    if (argc != 2) { return 1; }

    int32_t x = atoi(argv[1]);
    if ( x == 0) { return 2; }

    // Opaque predicate: always true
    if (is_even(foo(x))) {
        printf("Even\n");
    } else {
        // This branch is never taken
        printf("Odd\n");
    }
    return 0;
}

Opaque predicates are common in adversarial code and malware as an obfuscation technique (e.g. see Takahiro Haruyama's VirusBulletin article), though legitimate uses exist too.

The else branch in the code above is never taken, so we'd do well to remove it, thus making the program simpler and faster. But how can we be sure that foo always returns an even number?

To our human intelligence this may be obvious:

If x is even, then x * (x - 1) is even too, as multiplying by an even number always gives us an even number.
If x is odd, then (x - 1) must be even as decrementing an odd number gives us an even number.

Ideally we'd like to be able to prove this automatically. And there are three broad approaches we can take.

Brute Force

Basically run foo against all int32_t values and observe that it never produces an odd number. We'd have to run the program over four billion times - totally impractical, doesn't scale and mentioned here for completeness only.

SMT

Use a Satisfiability Modulo Theories (SMT) solver and, if we're lucky, it should magically give us the answer. Let's implement foo in Z3, a popular SMT solver (we'll use Python bindings as it's easier to follow compared to the original Lisp-like syntax).

import z3
# Remember, computers are logic machines, they don't do arithmetic,
# they simulate it. Therefore we use 32-bit BitVec to represent int32_t!
x = z3.BitVec('x', 32)
foo = x * (x - 1)
# Find an x, such that `foo` is odd
z3.solve( (foo & 1) != 0)

Running this code prints out no solution - this tells us that Z3 couldn't find a counterexample, which effectively proves that all evaluations of foo produce even bit vectors.

SMT-based approach to proving program properties is used in verifying compilers and Symbolic Execution with moderate success. It's not without downsides, however: it is computationally expensive and may time out on complex expressions. Symbolic Execution specifically does not have a baked-in strategy to deal with loops so its analyses don't have a clearly defined termination condition.

But what if we could have some of the SMT's rigor without the computational cost? This is where we might want to turn to abstraction.

For more information on SMT check out The Calculus of Computation by Bradley and Manna - an excellent introduction into the theory. If you want something more practical, check out angr, a popular symbolic execution framework.

Abstraction

If we don't want to use the heavy machinery of SMT solvers, we may revisit the brute force approach and see if we can make it practical. Recall that evenness means that the least significant bit (LSB) of a value is zero: 2 = 0b10, 4 = 0b100, 6 = 0b110 etc. Instead of thinking in terms of int32_t, let's introduce a new type: parity_t and a function to map or abstract our integers to it.

typedef enum {
    PARITY_EVEN = 0,
    PARITY_ODD,
    PARITY_UNKNOWN,
    // Keeps track of the number of possible values
    PARITY_NUMBER_OF_ELEMENTS,
} parity_t;

parity_t parity_abstract(int32_t x) {
    return (parity_t)(x & 1);
}

Then we can redefine the multiplication and subtraction operators in terms of parity_t:

// Abstract subtraction in the parity domain.
parity_t parity_sub(parity_t a, parity_t b);
// Abstract multiplication in the parity domain.
parity_t parity_mul(parity_t a, parity_t b);

The implementation of these two functions can be described by the following operation tables:

sub     || even    | odd     | unknown        mul     || even    | odd     | unknown 
--------------------------------------        --------------------------------------
even    || even    | odd     | unknown        even    || even    | even    | even
odd     || odd     | even    | unknown        odd     || even    | odd     | unknown
unknown || unknown | unknown | unknown        unknown || even    | unknown | unknown

Before we move further, we may want to ask ourselves whether this is a good abstraction. I think it is, for two reasons:

PARITY_EVEN represents the set of all even int32_t values, and PARITY_ODD - all the odd ones. Combined they cover the whole range of possible values, so we are not missing anything.
Parity only depends on the LSB. For any arithmetic or logical operation we genuinely don't care about the rest of the bit vectors involved.

Here's my lousy C code based on the tables above:

parity_t parity_sub(parity_t a, parity_t b) {
    if (a == PARITY_UNKNOWN || b == PARITY_UNKNOWN) {
        return PARITY_UNKNOWN;
    }
    // Other cases behave like regular subtraction (ignoring the carry).
    return (a - b) & 1;
}

parity_t parity_mul(parity_t a, parity_t b) {
    if (a == PARITY_EVEN || b == PARITY_EVEN) {
        return PARITY_EVEN;
    }

    // If we're here, a and b are either ODD or UNKNOWN
    if (a == PARITY_ODD && b == PARITY_ODD) {
        return PARITY_ODD;
    }

    return PARITY_UNKNOWN;
}

Finally, we can abstract the function foo using the abstract operations above and execute it for every possible parity_t:

parity_t foo_abstract(parity_t xa) {
    return parity_mul(xa, parity_sub(xa, parity_abstract(1)));
}

int main(void) {
    for (parity_t xa = 0; xa < PARITY_NUMBER_OF_ELEMENTS; ++xa ) {
        parity_t res = foo_abstract(xa);
        printf("foo_abstract(%s) = %s\n", parity_to_str(xa), parity_to_str(res));
    }
}

This should produce the following output:

foo_abstract(EVEN) = EVEN
foo_abstract(ODD) = EVEN
foo_abstract(UNKNOWN) = UNKNOWN

Since we agreed earlier that this is a good abstraction, we have now proven that for every definite parity - both EVEN and ODD - the result is EVEN. And since PARITY_EVEN and PARITY_ODD cover the full range of int32_t, we have considered all possibilities.

UNKNOWN here is abstract (as opposed to the other two, which are concrete) - meaning it can be either odd or even, we just don't have enough information to tell. And, as we can see, our abstract domain is not powerful enough to figure out that foo_abstract(PARITY_UNKNOWN) must also return EVEN. This is because our parity_* functions treat their operands as independent. For instance, parity_mul doesn't "know" that x and x - 1 are related, and as such it outputs UNKNOWN. We'll talk about relational abstract domains at some point, but not for a minute.

A full parity analysis of, say, C would require implementing a parity_* variant for every operation the language supports. This is cumbersome, but sometimes an analysis like this has to be performed on the abstract syntax tree. Other times it is done at the level of intermediate representation (IR). For example, modern compilers rarely generate assembly code directly, they typically generate some IR (e.g. LLVM), on which they perform analyses and optimizations, and then translate it to the underlying assembly code. Not only are IRs simple (usually under 100 operations), they also come with accompanying translators that support most common architectures. Similarly, reverse engineering tools lift machine code to an IR first. Later in the series we'll be using Ghidra and taking advantage of its excellent IR Pcode.