cleeus.de

Kai Dietrich / 2015-02-12T22:00 / filed under: intel assembler hash optimization

Hash functions are an essential part of computer science. They map arbitrary length inputs into a fixed length outputs. There are general purpose and cryptographic hash functions. Cryptographic hash functions are one-way functions that provide certain guarantees on the complexity to find collisions. Non-cryptographic or general purpose hash functions do not provide such guarantees and are used for implementing hashtables. Their output should be evenly distributed and look random so that few collisions in the output occur.

One of those functions is called DJBX33A. It is used in the hash table implementation of Python (although as of 3.4 only for very short inputs) and many other products. It's a very simple algorithm:

hash_i+1 = hash_i * 33 + c_i
hash_0 = 5381

Where c_i is the current input byte and all operations are on 32bit signed integers. This is so simple, effective and fast it almost makes you cry. Even a trivial implementation on a modest compiler will probably produce a decent performance.

It has only one flaw: It is hard to vectorize on MMX/SSE/Neon/... . This is due to the iterative nature where each round depends on the output of the round before. The auto-vectorization engines of modern C/C++ compilers can do many magical things but this is still out of reach.

So what can be done? There is a simple mechanism that can be used here: Just run this algorithm four times and iterate the input bytes over the four hash states. This will effectively split the input bytestream into four seperate streams that can be hashed in parallel with vector instructions. The output will be 128bit which then can be hashed down into 64 or 32 bit output using regular DJBX33A.

I hereby call this algorithm X4DJBX33A.

I had some time during the last days and implemented X4DJBX33A using Intel intrinsics in several instructions sets and variants. Here are some benchmark results using an Intel Core-i7 4960HQ 2.6GHz CPU with gcc-4.8 on linux/amd64:

DJBX33A reference C: 1198 MiB/s
DJBX33A optimized C: 1614 MiB/s
X4DJBX33A reference C: 1039 MiB/s
X4DJBX33A optimized C: 3119 MiB/s
X4DJBX33A mmx vectorized: 3484 MiB/s
X4DJBX33A sse2 vectorized: 6196 MiB/s
X4DJBX33A ssse3 vectorized: 6658 MiB/s

Not bad, huh?

I published the code and the benchmarks on github: cleeus/hashx4. I started looking into other hash functions that are used in important hashtable implementations (namely siphash24 that is used in python 3.4 and later).

Looking at the benchmark results, there are also some important lessons to be learned:

If it's not an emberrassingly parallel algorithm, make it one (and then vectorize).

The DJBX33A algorithm can be implemented very fast but it is hard to vectorize. The modified version which I call X4DJBX33A seems to be well suited for running on 128bit or 64bit wide vector registers. When it is not vectorized, it is slower than a well implemented vanilla DJBX33A algorithm.

The SSSE3 version is an experiment.

SSSE3 has number of new integer intrinsics. Among them a mighty _mm_shuffle_epi8 that can be used to reorder all 16 bytes in a 128bit xmm register into an arbitrary permutation. Using this opcode can lead to an alternative implementation that doesn't use the two _mm_unpack instructions. On some CPUs this seems to be faster, on most slower CPUs it is not.

Opcode scheduling is important, so use intrinsics.

Benchmarks with -O0 and -O1 builds have shown that even the MMX and SSE implementations get noticeably slower when not optimized. A look at the disassembled binaries of the -O2 and -O3 optimized builds shows that the compiler reorders instructions. It probably does this to enhance the instruction level parallelism and provide the CPU with useful instructions while it is still waiting for some RAM I/O to complete. Using intrinsics instead of raw assembler code allows the developer to leverage the wisdom of the compiler.

The C implementation performance varies widely with the compiler.

MSVC and GCC seem to produce very different code from the C implementations. This is not surprising as the research on auto-vectorization and codegen is still ongoing. The SSE2 version seems to be much more stable across compilers and platforms.

Know your instruction set.

I struggled quite a bit with the SSE2 instruction set and initially failed to produce a vectorized version that was faster than the scalar one. This was due to insufficient knowledge of the instruction set. In the end learning all the intrinsics and their performance characteristics is what enables a developer to find a good solution.

Alignment matters.

Besides the reference C implementations, I produced optimized (but still C-based) versions of DJBX33A and X4DJBX33A. A major optimization was to hash the input with a simple implementation until an alignment boundary of 16 bytes in the input memory block was reached. Then the compiler gets a hint to assume that the pointer is aligned to a 16 byte boundary. After the hint, an inner loop which hashes 16 byte chunks and an outer loop which iterates the inner loop is run. This keeps the alignment assumption. This assumption allows the compiler to use opcodes that rely on alignment and enables auto-vectorization.

SSE2 is everywhere. Use it.

If you are on a 64bit X86 processor, you are guaranteed to have SSE2. On 32bit X86, every processor sold in the last 10 years has SSE2. From an economic point of view you can probably ignore non-SSE2 x86 CPUs or just provide one C implementation.