Bits of Entropy

Hamming Weight: Reinventing Efficient Implementation


I came across this wonderful algorithm not long ago (source):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public long evaluateUnary(int sizeout, int sizein, long val) {
    val = (val & 0x5555555555555555L) + ((val >>> 1) & 0x5555555555555555L);
    val = (val & 0x3333333333333333L) + ((val >>> 2) & 0x3333333333333333L);
    val = (val & 0x0f0f0f0f0f0f0f0fL) + ((val >>> 4) & 0x0f0f0f0f0f0f0f0fL);
    val = (val & 0x00ff00ff00ff00ffL) + ((val >>> 8) & 0x00ff00ff00ff00ffL);
    val = (val & 0x0000ffff0000ffffL) + ((val >>> 16) & 0x0000ffff0000ffffL);
    int res = (int) (val & 0xff);
    res += (int) ((val >> 32) & 0xff);
    return res;
}

It's a common implementation of Population Count, a.k.a. Hamming Weight, which simply counts the number of ones in the binary representation of a number. Initially, I couldn't quite grok how this sequence of arithmetic and logical operations achieves that. So, I thought it would be fun to reinvent this implementation by starting from the simplest, most straightforward, and inefficient version, and gradually working my way to something that looks like the Java code above. If this sounds like fun to you, give it a go before reading on, and then see if you've arrived at the solution the same way.

To keep things simple let's approximate the time complexity by counting the total number of operations used (+-s, &-s and >>-s). The efficient implementation above has 24 operations, that's our goal.

Here is my first attempt:

1
2
3
4
5
6
7
def popcount0(x: int) -> int:
  total = 0  # Running total of ones
  # Loop for 64 steps
  for i in range(64):
    # Isolate the n-th bit and add to the running total
    total += (x >> i) & 1
  return total

Upon each step of the loop we have one addition, one shift and one AND, which makes 64 * 3 = 192 operations -- 8 times slower than the target!

The obvious way to optimize this is to do a divide-and-conquer kind of thing: a 64-bit number can be viewed as two 32-bit numbers glued together. Then total also serves as two concatenated 32-bit numbers counting ones in the corresponding 32-bit halves. We'll just need to do a tiny bit of extra work in the end to add the two halves of total together. A 32-bit number can have at most 32 ones. The number 32 can be represented by just 6 bits, which means that we don't have to worry about the rightmost half of total spilling over into the leftmost one.

Here's what it looks like:

1
2
3
4
5
6
7
8
9
def popcount1(x: int) -> int:
  total = 0
  for i in range(32):
    # Shift, then isolate 32nd and 64th bits and add to the total
    total += (x >> i) & 0x0000000100000001
  # Add the two halves of `total` together. 
  # Number 0x3f (0b111111) is used to isolate the last
  # 6 bits of each 32-bit half.
  return (total & 0x3f) + ((total >> 32) & 0x3f)

This uses just 100 operations, nearly two times faster than our previous attempt!

Now that we've tasted success, nothing stops us from further dividing the task. After all a 64-bit number can be viewed as four 16-bit numbers glued together. But wait... by the same logic we can divide a 64-bit number into eight 8-bit numbers, or 16 four-bit numbers or 32 two-bit numbers, or 64 one-bit numbers. However the more we subdivide t the more work we'll need to do post-loop. And when we subdivide by 64 we're effectively back at the original slow implementation.

To drive this point home here's the subdivision by four, which takes 60 operations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def popcount2(x: int) -> int:
  total = 0
  for i in range(16):
    total += (x >> i) & 0x0001000100010001
  # At this point post-loop extra compupation has
  # gotten a bit too verbose, so we need another loop.
  # return (total & 0x1f) + ((total >> 16) & 0x1f) \
  #   + ((total >> 32) & 0x1f) \
  #   + ((total >> 48) & 0x1f)
  # The second loop 
  total_ = 0
  for i in range(0, 64, 16):
    total_ += (total >> i) & 0x1f
  return total_

Now let's look at subdivision by 32, which takes 112 operations, so we're not even improving anymore:

1
2
3
4
5
6
7
8
def popcount5(x: int) -> int:
  # 0x55.. is 0x01010101..
  total = (x & 0x5555555555555555) + ((x>>1) & 0x5555555555555555)

  total_ = 0
  for i in range(0, 64, 2):
    total_ += (total >> i) & 3
  return total_

We no longer need the first loop as it's just two operations, but the second loop now runs for 32 steps, no wonder it's getting slower again. Here we're counting the number of ones in every two-bit subdivision of a 64-bit number and then the loop adds all those counts together. But what if we take those two-bit counts and, before going into the loop, add them together into four-bit subdivisions? This will shorten the loop and somewhat improve the performance.

To illustrate what I mean, here's an example. Consider an 8-bit number 147 (0b10010011). It has 4 ones. Subdivide it into four two-bit parts: 10-01-00-11 and count the number of ones in each part: 01-01-00-10 (or 1-1-0-2 in decimal).

1
2
3
4
  10010011 --> (>>=1) -->  01001001
& 01010101               & 01010101
----------               ----------
  00010001       +         01000001 = 01010010

Now let's take the result 01-01-00-10 and add the first and the second two-bit parts together into a four-bit nibble: 01 + 01 = 0010; do the same with the third and the fourth two-bits: 00 + 10 = 0010. Computationally it's almost the same operation as the above except we shift everything by two bits and AND (&) by 0b00110011:

1
2
3
4
  01010010 --> (>>=2) --> 00010100
& 00110011              & 00110011
----------              ----------
  00010010       +        00010000 = 00100010

We get two 4-bit nibbles 0010-0010 (or 2-2 in decimal). Then we can just add these two nibbles together to get 4 -- the correct Hamming Weight of 147.

Let's apply this to our implementation:

1
2
3
4
5
6
7
8
9
def popcount6(x: int) -> int:
  total = (x & 0x5555555555555555) + ((x>>1) & 0x5555555555555555)
  # 0x33.. is 0b011011..
  total = (total & 0x3333333333333333) + ((total>>2) & 0x3333333333333333)

  total_ = 0
  for i in range(0, 64, 4):
    total_ += (total >> i) & 0b1111
  return total_

And notice that the algorithm is looking more and more like the efficient implementation and it takes 56 operations -- we're improving again! We can apply this idea until we've gotten rid of the loop entirely. Our 4-bit nibbles can be grouped into bytes (8 bits), those into half-words (16 bits), those ones into words (32 bits) and those into a single 64-bit number holding the result. It's all the same operation, the only thing changing is the size of the shift and the multiplier. And in the end we get this:

1
2
3
4
5
6
7
def popcount9(x: int) -> int:
  total = (x & 0x5555555555555555) + ((x>>1) & 0x5555555555555555)
  total = (total & 0x3333333333333333) + ((total>>2) & 0x3333333333333333)
  total = (total & 0x0f0f0f0f0f0f0f0f) + ((total>>4) & 0x0f0f0f0f0f0f0f0f)
  total = (total & 0x00ff00ff00ff00ff) + ((total>>8) & 0x00ff00ff00ff00ff)
  total = (total & 0x0000ffff0000ffff) + ((total>>16) & 0x0000ffff0000ffff)
  return (total & 0xffff) + ((total>>32) & 0xffff)

And we're done.

Was this useful? In the grand scheme of things probably not, but it was a fun little puzzle.