Hamming Weight: Reinventing Efficient Implementation
I came across this wonderful algorithm not long ago (source):
1 2 3 4 5 6 7 8 9 10 |
|
It's a common implementation of Population Count, a.k.a. Hamming Weight, which simply counts the number of ones in the binary representation of a number. Initially, I couldn't quite grok how this sequence of arithmetic and logical operations achieves that. So, I thought it would be fun to reinvent this implementation by starting from the simplest, most straightforward, and inefficient version, and gradually working my way to something that looks like the Java code above. If this sounds like fun to you, give it a go before reading on, and then see if you've arrived at the solution the same way.
To keep things simple let's approximate the time complexity by counting the total number of operations used (+
-s, &
-s and >>
-s). The efficient implementation above has 24 operations, that's our goal.
Here is my first attempt:
1 2 3 4 5 6 7 |
|
Upon each step of the loop we have one addition, one shift and one AND, which makes 64 * 3 = 192 operations -- 8 times slower than the target!
The obvious way to optimize this is to do a divide-and-conquer kind of thing: a 64-bit number can be viewed as two 32-bit numbers glued together. Then total
also serves as two concatenated 32-bit numbers counting ones in the corresponding 32-bit halves. We'll just need to do a tiny bit of extra work in the end to add the two halves of total
together. A 32-bit number can have at most 32 ones. The number 32 can be represented by just 6 bits, which means that we don't have to worry about the rightmost half of total
spilling over into the leftmost one.
Here's what it looks like:
1 2 3 4 5 6 7 8 9 |
|
This uses just 100 operations, nearly two times faster than our previous attempt!
Now that we've tasted success, nothing stops us from further dividing the task. After all a 64-bit number can be viewed as four 16-bit numbers glued together. But wait... by the same logic we can divide a 64-bit number into eight 8-bit numbers, or 16 four-bit numbers or 32 two-bit numbers, or 64 one-bit numbers. However the more we subdivide t the more work we'll need to do post-loop. And when we subdivide by 64 we're effectively back at the original slow implementation.
To drive this point home here's the subdivision by four, which takes 60 operations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Now let's look at subdivision by 32, which takes 112 operations, so we're not even improving anymore:
1 2 3 4 5 6 7 8 |
|
We no longer need the first loop as it's just two operations, but the second loop now runs for 32 steps, no wonder it's getting slower again. Here we're counting the number of ones in every two-bit subdivision of a 64-bit number and then the loop adds all those counts together. But what if we take those two-bit counts and, before going into the loop, add them together into four-bit subdivisions? This will shorten the loop and somewhat improve the performance.
To illustrate what I mean, here's an example. Consider an 8-bit number 147 (0b10010011). It has 4 ones. Subdivide it into four two-bit parts: 10-01-00-11
and count the number of ones in each part: 01-01-00-10
(or 1-1-0-2 in decimal).
1 2 3 4 |
|
Now let's take the result 01-01-00-10
and add the first and the second two-bit parts together into a four-bit nibble: 01 + 01 = 0010
; do the same with the third and the fourth two-bits: 00 + 10 = 0010
. Computationally it's almost the same operation as the above except we shift everything by two bits and AND (&) by 0b00110011:
1 2 3 4 |
|
We get two 4-bit nibbles 0010-0010
(or 2-2 in decimal). Then we can just add these two nibbles together to get 4 -- the correct Hamming Weight of 147.
Let's apply this to our implementation:
1 2 3 4 5 6 7 8 9 |
|
And notice that the algorithm is looking more and more like the efficient implementation and it takes 56 operations -- we're improving again! We can apply this idea until we've gotten rid of the loop entirely. Our 4-bit nibbles can be grouped into bytes (8 bits), those into half-words (16 bits), those ones into words (32 bits) and those into a single 64-bit number holding the result. It's all the same operation, the only thing changing is the size of the shift and the multiplier. And in the end we get this:
1 2 3 4 5 6 7 |
|
And we're done.
Was this useful? In the grand scheme of things probably not, but it was a fun little puzzle.