cleeus.de

Kai Dietrich / 2018-02-13T18:00 / filed under: security

A hobby of mine seems to be authentication protocols. Getting an authentication protocol right is surprisingly hard both in theory and practice. And I like pulling them apart and understanding them.

The Authentication Mechanism Trilemma

As I already wrote in an older blog post about CRAM-MD5, an authentication protocol should fulfill the following non-functional requirements:

It does not reveal any secret on the communication channel.
The server stores nothing that would allow an attacker to recover a secret.
It must be simple to implement.

As said in my CRAM-MD5 post, it seems to be a "choose two" triangle situation (sometimes called a trilemma). Complicated protocols like Kerberos might be good at 1. and 2. but you won't implement it in a day. Asymmetric cryptography also is pretty complicated and requires management of key material by the user.

Plaintext authentication like HTTP BasicAuth or SSH password authentication allows the server to only store strongly hashed secrets (2.) and are simple to implement (3.), but of course the plaintext password goes through the wire (if the underlying encryption channel is protected that may be ok, though).

CRAM-MD5 on the other hand is good at 1. and 3. - It's a challenge-response based mechanism that can be implemented on top of only an existing MD5 implementation in about day - it's not much more then a HMAC on a nonce. But it requires that the server stores the secret that can be used to create the authentication proofs. That stored secret is only an unsalted MD5 hash of the secret - which can easily be reversed nowadays. Using a different hash function won't fix that problem because the specification has no room for salts.

SCRAM

Recently I stumbled upon SCRAM which is a big improvement over CRAM. SCRAM stands for "Salted Challenge Response Authentication Mechanism". It's specified in RFC5802 for SASL and RFC7804 for HTTP. The complexity is low enough that you can reasonably implement it in a few days. All you need is a hash function and a HMAC and PBKDF2 implementation on top of that. HMAC and PBKDF2 are simple enough to implement on your own, if you have to.

Through a few nice crypto tricks SCRAM solves the trilemma (not perfectly, but to some degree).

Key Setup

When the user registers at the server or otherwise sets up a new password, the server derives a client specific StoredKey and puts it together with some parameters into a database as ServerRecord.

SaltedPassword  := Hi(password, salt, ic)
ClientKey       := HMAC(SaltedPassword, "Client Key")
ServerKey       := HMAC(SaltedPassword, "Server Key")
StoredKey       := H(ClientKey)

ServerRecord    := StoredKey,ServerKey,salt,ic

Let's go through these steps, one by one:

First we calculate SaltedPassword by passing the password through an iterative password hashing function Hi. The specific function specified here is PBKDF2 with HMAC as the underlying hash function. It uses the parameters salt which is a user specific random string and ic which is the iteration count.

Then we derive the child keys ClientKey and ServerKey by applying two fixed one-way functions on the parent key SaltedPassword. These one-way functions are simple HMACs with the known "secrets" "Client Key" and "Server Key". Everyone who knows SaltedPassword can recreate ClientKey and ServerKey. But you cannot recover the parent key SaltedPassword from either of the child keys. This, by the way, is one neat crypto trick to remember. :)

ServerKey is later used by the server to prove to the client that the server knows about the same password as the client. The ServerRecord is what the server needs to put into its database so that it can later check proofs from clients and issue proofs to clients.

The structure of the key setup is a little key-tree where each branch is a key derivation with a one-way function.

         password    salt,ic
             |         |
             +----+----+
                  |
                  v
       +----SaltedPassword----+
       |                      |
       v                      v
   ClientKey              ServerKey
       |
       v
   StoredKey

Notice how the server only stores one part of the root (salt,ic) and the leafs (StoredKey and ServerKey).

Challenge Response

The next step is to construct a challenge that will be "signed" by both server and client. To create that challenge the client first sends the server its requested username and a random nonce. The server in response sends the salt and iteration count ic and another random nonce to the client. This all together is the AuthMessage:

AuthMessage     := username,client-nonce,salt,ic,server-nonce

Then the client creates the proof for the server:

ClientSignature := HMAC(StoredKey, AuthMessage)
ClientProof     := ClientKey XOR ClientSignature

Or as an ascii art drawing:

server  client                                                                                      
 nonce   nonce   username  salt,ic   password
    |      |         |      |  |       |
    +------+----+----+------+  +---+---+
                |                  |
                |                  v
                |            SaltedPassword
                |                  |
                |                  v
                |              ClientKey--+
                |                  |      |
                v                  v      |
           AuthMessage----+----StoredKey  |
                          |               |
                          v               |
                   ClientSignature-------XOR
                                          |
                                          v
                                    ClientProof

The client can recalculate the StoredKey using the salt and ic from the AuthMessage and the original password from the user. ClientProof is what the client sends to the server. Now when the server receives a proof from a client as ClientProof' it needs to check it. It does so by also calculating ClientSignature and extracting ClientKey' from the XOR.

ClientKey'      := ClientProof' XOR ClientSignature

and then

StoredKey'      := H(ClientKey')

Now it can compare StoredKey' to StoredKey and if they match, it knows the client proof is valid.

They brilliant detail here is the construction of ClientProof. As you can see from the definition above, it's calculated from ClientKey. But ClientKey must never go over the wire because otherwise an observing attacker could impersonate the client. To prevent that, it is encrypted by using ClientSignature as an one-time-pad. Since ClientSignature is a pseudo-random value from an HMAC calculation that depends on at least the client nonce, this should be a perfectly safe encryption. But why can't we just send the ClientSignature in the first place? Because when an attacker obtains a ServerRecord from the server it knows the StoredKey and could impersonate the client. By having the client prove knowledge of the ClientKey, obtaining the StoredKey from the server database is not enough to impersonate the client.

Another way to understanding the mechanism is to think of it as follows:

Like in plaintext authentication, the client shows the password in the form of the directly derived ClientKey to the server. This allows the server to only store a password hash. But instead of transmitting the ClientKey over the wire in plaintext, it is encrypted under a session key (ClientSignature) that is derived from the nonces and the password hash.

Server Proof

For completeness we should continue with the protocol as the server-to-client authentication is still open. When the client has proven to the server, that it is legitimate, the server must send a ServerSignature to the client:

ServerSignature := HMAC(ServerKey, AuthMessage)

The client can check the ServerSignature by also calculating it from the users password (and salt and ic) and comparing the results. This could prevent bogus login pages from attackers where the evil server doesn't really check the users password but keeps interacting with the user to trick him into further actions.

';--you have been pwned?

So what can an attacker do with a stolen ServerRecord? Sadly there is one weakness in the protocol that can easily be spotted (and is also lined out in the RFCs): If an attacker obtains the StoredKey from the server and learns about the AuthMessage and ClientProof from an authentication exchange, he can calculate the ClientSignature and subsequently the ClientKey. Knowing the ClientKey he can then impersonate the client to the server.

Implementation Remarks

I have simplified a few things that are not essential for understanding the basic mechanism but important for getting the implementation conforming and precise. So if you head out to implement SCRAM, go read the RFCs. There are also two issues that are important when implementing authentication protocols in general: timing side-channels and crypto hygiene.

Timing Side-Channels Analysis

If you don't know what a timing side-channel attack is, please read the wikipedia article before going any further. It will be a revelation.

The hash function, HMAC and Hi/PBKDF2 implementation should be constant time. But that's usually the case for even naive implementations.

Another weak point usually are comparisons (memcmp,strcmp) during proof verification. In particular, when the server compares StoredKey' and StoredKey for equality, a timing side-channel could leak information about the StoredKey. But for that to work, the attacker has to control StoredKey'. Since StoredKey' is the hash H(ClientProof' XOR ClientSignature), the attacker has to know ClientSignature to extract any usable information from the timing side-channel. ClientSignature in turn depends on StoredKey and is unknown to the attacker.

As you can see, SCRAM is so well designed that it makes it hard to mess up and I think it may even be completely resistant to timing side-channel attacks. I would nonetheless make the server side verification comparison constant time, just to be on the safe side. The usual technique to XOR both values and checking the result to be all zeroes should work well since the compared strings are constant length.

Crypto Hygiene

Another best practice in crypto implementations is to overwrite secrets with zeroes as soon as they are not needed anymore. This protects against local attackers who may be able to observe memory. In the days of Spectre attacks this is relevant more then ever.

Specifically the ClientKey' that is calculated on the server during verification is something that should be deleted as quickly as possible. While you are at that, know your tools - a simple memset is not enough. It is not trivial to overwrite memory that is not used afterwards because compilers may consider that useless writes and optimize them away.

Recovering from a Database Breach

Another aspect that is possible with SCRAM is to recover from a database breach. When an attacker has stolen the ServerRecords, that doesn't mean the users password needs to be changed. It only means that the ClientKey and ServerKey need to be replaced.

In order to do that, you need multiple ServerRecords for one user and store all but one in a safe place (like an offline/write-only database). So if you have a breach and have cleaned up the mess, you can pull out one of the spare ServerRecords for the users from the offline database and you and your users are safe again without users having to change their passwords. For that to be safe, you have to make sure to use a different salt for every user and ServerRecord.

The Authentication Trilemma Solution

So how does SCRAM solve the "chose two" trilemma from above. I think you could say that it trades in a little bit of 1. for a lot of 2. . By transferring the encrypted ClientKey to the server, the server learns about ClientKey and in case of a previous database breach an attacker can also learn about ClientKey. In return the server can store strongly hashed passwords in the form of StoredKey which require costly brute force attacks to crack. I think this is a good trade-off.

Discuss on HackerNews

Discuss on r/netsec

re2c - regular expressions for cowboys

Kai Dietrich / 2017-10-09T21:00 / filed under: C C++

Regular expressions are handy. Who hasn't written a quick and dirty "parser" here or there that just throws a regex at some input and checks if it matches. In most modern languages there are builtin regex tools with excellent features and of good quality. In C/C++ things are a little more difficult.

You can use the shiny new regex library from the C++11 standard if your compiler supports it, or you can use Boost.Regex. Both are ... let's call it "surprisingly complex". And on top of it, the runtimes are of varying quality.

Also, sometimes you can only use C which would then force you to use another external library.

But there is an alternative. If you have worked with regexes more deeply, you will know that they basically are tiny programs in a weird programming language. The regex engine is the interpreter. Some regex engines are even interpreters with an optimizing JIT compiler. But ... if you can compile stuff during runtime, you can also compile it ahead of time - given you know the program at compile time.

So if you have a fixed regex in your program, there should be tools that can compile your regex into executable code. One such tool is re2c.

re2c is a code generator that takes a bunch of regular expressions and generates a C function that you can compile and call. This can be used for writing the lexer stage in a parser/compiler and thus re2c is also sometimes called a lexer generator. But this is nothing else then a more fancy regex compiler. So let's have a look at an example.

Given this C++ program with re2c annotations:

enum Token {
  TK_None,
  TK_PositiveDecimal,
  TK_NegativeDecimal
};

Token regex_match_decimal(const char *YYCURSOR) {
    const char *YYMARKER;

    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;

        end = "\x00";
        dec = [0-9]+;
        sign = "-";

                   * { return TK_None; }
             dec end { return TK_PositiveDecimal; }
        sign dec end { return TK_NegativeDecimal; }
    */
}

re2c will generate this lexer:

/* Generated by re2c 1.0.2 on Mon Oct  9 20:41:14 2017 */
enum Token {
  TK_None,
  TK_PositiveDecimal,
  TK_NegativeDecimal
};


Token regex_match_decimal(const char *YYCURSOR) {
    const char *YYMARKER;


{
  char yych;
  yych = *YYCURSOR;
  switch (yych) {
  case '-': goto yy4;
  case '0':
  case '1':
  case '2':
  case '3':
  case '4':
  case '5':
  case '6':
  case '7':
  case '8':
  case '9': goto yy5;
  default:  goto yy2;
  }
yy2:
  ++YYCURSOR;
yy3:
  { return TK_None; }
yy4:
  yych = *(YYMARKER = ++YYCURSOR);
  switch (yych) {
  case '0':
  case '1':
  case '2':
  case '3':
  case '4':
  case '5':
  case '6':
  case '7':
  case '8':
  case '9': goto yy6;
  default:  goto yy3;
  }
yy5:
  yych = *(YYMARKER = ++YYCURSOR);
  switch (yych) {
  case 0x00:    goto yy9;
  case '0':
  case '1':
  case '2':
  case '3':
  case '4':
  case '5':
  case '6':
  case '7':
  case '8':
  case '9': goto yy11;
  default:  goto yy3;
  }
yy6:
  yych = *++YYCURSOR;
  switch (yych) {
  case 0x00:    goto yy13;
  case '0':
  case '1':
  case '2':
  case '3':
  case '4':
  case '5':
  case '6':
  case '7':
  case '8':
  case '9': goto yy6;
  default:  goto yy8;
  }
yy8:
  YYCURSOR = YYMARKER;
  goto yy3;
yy9:
  ++YYCURSOR;
  { return TK_PositiveDecimal; }
yy11:
  yych = *++YYCURSOR;
  switch (yych) {
  case 0x00:    goto yy9;
  case '0':
  case '1':
  case '2':
  case '3':
  case '4':
  case '5':
  case '6':
  case '7':
  case '8':
  case '9': goto yy11;
  default:  goto yy8;
  }
yy13:
  ++YYCURSOR;
  { return TK_NegativeDecimal; }
}

}

What you can see is that:

The generated code doesn't need a runtime.
The generated code is just raw C.
It's a simple statemachine implementation.

If you like that, have a look at the manual and examples on re2c.org. In the latest versions it even supports utf-8, push matchers and a bunch of compiler extensions to generate even faster matchers.

Thank you Kdevelop5.AppImage!

Kai Dietrich / 2017-09-16T12:00 / filed under: C++ linux

As a C++ dev working on bigger projects, you probably want an IDE with semantic code highlighting and navigation, a nice visual debugger and a good file/project overview to navigate source code.

On Mac there is Xcode (which I have mixed feelings about, but it does the job). On Windows there is no way around MSVC (which is probably the best heavy-weight C++ IDE around). On Linux there is a lot of vi and emacs and gdb and ddd and (not exactly IDEs) ... ugh ... and Kdevelop5 :) . Kdev5 is a KDE based IDE written in C++, so it's quite snappy. The previous version 4 was a bit brittle because it used a custom C++ parser and a custom CMake interpreter. Version 5 uses the llvm clang frontend as the C++ parser and CMake itself as the CMake interpreter. So Kdev5 is a huge improvement.

At work I like to use the latest LTS version of Ubuntu for convenience reasons. Most of the times Ubuntu LTS works really well and doesn't need full system upgrades every few months. Unfortunatly Ubuntu LTS 16.04 still ships Kdev4 and I'm not too keen on adding PPAs for Kdev5 and then spending a day to repair a broken system.

The devs of Kdevelop probably recognized this problem and create x86_64 linux builds bundled with all dependencies in an executable container format called AppImage. You can just go to kdevelop.org/download , grab the .AppImage file, mark it as executable and run it. This works really well and gets you up and running for C++ development very quickly without you having to mess with your system. Thank you Kdev5 devs, this saved me some time and trouble! :)

PS: Of course there are other IDEs on Linux: Qt Creator, Eclipse CDT, Code::Blocks, CLion (proprietary), Anjuta... If you don't like Kdev5, have a look at those.

PPS: Yes there are security issues with upstream packaging and built-in dependencies. But an IDE is not exactly a thing that I expect to process untrusted input.

My Favourite C++ Idiom: Static PIMPL / Fast PIMPL

Kai Dietrich / 2017-03-10T20:00 / filed under: C++ optimization allocation memory

When working in a large C++ codebase, compile times become a major wory. One reason for slow compile times will be transitive header dependencies. Every header file should include what it needs. The preprocessor will then collect all the files and build a huge source file (the compilation unit) that will be fed to the actual compiler. This creates lots of disk I/O and a ton of C++ code which must be parsed by the compiler frontend, for which code will be generated and finally most of that will be eliminated by the linker again.

One way to reduce the includes of a header is the PIMPL idiom. Instead of putting the implementation details of classes in the header, you move it into a source file (compilation unit).

Let's look at a simple example class that wraps stdio.h file handling:

/** \file File.h */

#include <stdio.h>
#include <string>

class File {
public:
  File();
  ~File();

  bool open(std::string path, std::string openmode);

private:
  std::string m_path;
  std::string m_openmode;
  FILE *m_handle;
};

The open(...) and close() methods would be implemented in the .cpp file something like this:

/** \file File.cpp */

bool File::open(std::string path, std::string openmode) {
  if(m_handle != NULL) {
    //close();
  }

  FILE* f = fopen(path.c_str(), openmode.c_str());

  if(f!=NULL) {
    m_handle = f;
    m_path = path;
    m_openmode = openmode;
  }
}

This imports the stdio.h header and everything that imports your beautiful new File class header will now get its global namespace polluted with C symbols. In this specific case a simple forward declaration for FILE struct could help but forward declarations are not always possible or a good solution.

A common pattern to completly hide the implementation of a class is to use the PIMPL ("private implementation" or "pointer to implementation") idiom. When using the PIMPL idiom, you put a shallow wrapper class in the header file. It has all the methods of the real class, but only one member: A pointer to a forward declared class. There are variations around the exact implementation technique. I like the following C++11 std::unique_ptr based version the most. For brevity, I will omit the required copy- and move-construtors and operator= signatures. Since a PIMPL class essentially becomes a container for a single object, you have to add these - see also The Rule of Three/Five. As a side-effect, this also makes it a trivially moveable object, giving you fast and exception-safe move semantics.

/** \file File.h */

#include <string>
#include <memory>

class File {
public:
  File();
  ~File();
  //copy- and move- ctors/operator= methods here

  bool open(std::string path, std::string openmode);

private:
  class FilePIMPL;
  std::unique_ptr<FilePIMPL> m_impl;
};

The complete implementation will be moved into a compilation unit.

/** \file File.cpp */

#include <File.h>
#include <stdio.h>

class File::FilePIMPL {
public:
  FilePIMPL() :
    m_handle(NULL)
  {}

  bool open(std::string path, std::string openmode) {
    //... implementation here
  }

private:
  std::string m_path;
  std::string m_openmode;
  FILE *m_handle;
};

File::File() :
  m_impl(new FilePIMPL())
{}

File::~File() {
}

bool File::open(std::string path, std::string openmode) {
  return m_impl->open(path, openmode);
}

This solved the problem of the hiding the complete implementation but introduced a new one: An additional allocation (new FilePIMPL()) of the PIMPL class on the heap in the default constructor File::File(). Sometimes this is not a big deal, especially when dealing with plumbing code or business logic. But allocations are not free. They take precious clock cycles, may grab a lock on the heap and thus limit parallelization or can fail and throw exceptions. When writing performance sensitive code, an additional allocation may be prohibitively expensive. But there is a solution in one of the more dangerous but magically enabling features of C++: placement new. Instead of constructing the PIMPL class on the heap with the regular new operator, we use the placement new operator and create the object in a embedded buffer inside the File class. There are two important details here: The buffer where our implementation class will be created must be

big enough to hold the object, and
properly aligned.

Since the compiler cannot see the implementation, it cannot know the size nor the the proper alignment and we must manually choose both.

/** \file File.h */

#include <string>
#include <cstddef>
#include <type_traits>

class File {
public:
  File();
  ~File();

  bool open(std::string path, std::string openmode);

private:
  static constexpr std::size_t c_impl_size = 128;
  static constexpr std::size_t c_impl_alignment = std::alignment_of<std::max_align_t>::value;
  typedef std::aligned_storage<c_impl_size, c_impl_alignment>::type aligned_storage_type;
  aligned_storage_type m_impl;
};

To make working with placement new a little more pleasant, I will add a few templated wrapper functions:

/** \file placement_new.h */

#include <cstddef>

///create an object of type T at a given address
template<typename T>
void placement_new(void *buffer, std::size_t buffer_size) {
  new(buffer) T();
}

///cast a given address to a pointer to type T
template<typename T>
T* placement_cast(void *buffer) {
  return reinterpret_cast<T*>(buffer);
}

///call the destructor of type T at a given address
template<typename T>
void placement_delete(void *buffer) {
  placement_cast<T>(buffer)->~T();
}

We will new use the placement_* functions to construct objects inside

/** \file File.cpp */

#include <File.h>
#include <stdio.h>
#include <placement_new.h>

class FilePIMPL {
  //full implementation in this class
};


File::File() {
  //construct a FilePIMPL instance in m_data
  placement_new<FilePIMPL>(&m_impl, sizeof(m_impl));
}

File::~File() {
  //destruct the FilePIMPL
  placement_delete<FilePIMPL>(&m_impl);      
}

bool File::open(std::string path, std::string openmode) {
  return placement_cast<FilePIMPL>(&m_impl)->open(path, openmode);
}

This is the basic mechanics of what I call the "Static PIMPL" idiom or pattern. I have not seen it used or called like this anywhere else, so I declare myself the inventor. But I'm probably wrong and it's already been used in a lot of places, maybe even inside STL implementations. If it has another name, please drop me a line.

(Update: Turns out Herb Sutter actually posted something like this as GotW#028 and calls it "Fast PIMPL" and if you google for Fast PIMPL, you will find multiple other authors describing more or less exactly what I'm lining out here)

As said above, the compiler cannot know the size and alignment requirements, so we must choose them manually. For the alignment, you can choose the platforms maximum alignment: std::alignment_of<std::max_align_t>::value. This is safe, but may cost a few bytes through over-alignment. For the size, you can measure the size of the implementation classes on the platforms you support and write some preprocessor logic to define a constant per platform. What also tends to work reaonably well is to measure the size on one platform, then devide it by sizeof(void*) and express the size constant as a multiple of sizeof(void*). When size and/or alignment are wrong, you are doomed and your code will fail in mysterious ways. Effects range from just slight performance penalties up to broken atomics (and thus invisible race conditions). Messing with alignment and getting it wrong is also a common source for CPU exceptions when the compiler generates vectorized code. And if it doesn't happen today, it can happen any time in the future with a new compiler. So it is essential to guarantee size and alignment.

To guarantee that, you have to add a few asserts and static_asserts.

/** \file placement_new.h */

template<typename T>
void placement_new(void *buffer, std::size_t buffer_size) {
  //check the size of the buffer (at runtime, in debug builds)
  assert(sizeof(T) <= buffer_size);
  //check the alignment of the buffer (at runtime, in debug builds)
  assert(std::align(std::alignment_of<T>::value, sizeof(T), buffer, buffer_size) == buffer );
  new(buffer) T();
}

/** \file File.cpp */
File::File() {

  static_assert(sizeof(m_impl) >= sizeof(FilePIMPL),
    "buffer not big enough to hold FilePIMPL");

  static_assert(
    std::alignment_of<aligned_storage_type>::value
    >=
    std::alignment_of<FilePIMPL>::value),
    "alignment requirements of FilePIMPL not fulfilled");

  //construct a FilePIMPL instance in m_data
  placement_new<FilePIMPL>(m_impl, sizeof(m_impl));
}

This is not a technique to use lightly. Only use it when you really have performance requirements that trump maintainability concerns. If your implementation is big or changes often, a classic PIMPL might be more appropriate as adjusting the buffer sizes will become a tedious activity. I used it successfully to implement Mutex classes that are used throughout the codebase and are implemented using each platforms native facilities (Win32 vs. pthread).

How not to vectorize DJBX33A

Kai Dietrich / 2015-02-12T22:00 / filed under: intel assembler hash optimization

Hash functions are an essential part of computer science. They map arbitrary length inputs into a fixed length outputs. There are general purpose and cryptographic hash functions. Cryptographic hash functions are one-way functions that provide certain guarantees on the complexity to find collisions. Non-cryptographic or general purpose hash functions do not provide such guarantees and are used for implementing hashtables. Their output should be evenly distributed and look random so that few collisions in the output occur.

One of those functions is called DJBX33A. It is used in the hash table implementation of Python (although as of 3.4 only for very short inputs) and many other products. It's a very simple algorithm:

hash_i+1 = hash_i * 33 + c_i
hash_0 = 5381

Where c_i is the current input byte and all operations are on 32bit signed integers. This is so simple, effective and fast it almost makes you cry. Even a trivial implementation on a modest compiler will probably produce a decent performance.

It has only one flaw: It is hard to vectorize on MMX/SSE/Neon/... . This is due to the iterative nature where each round depends on the output of the round before. The auto-vectorization engines of modern C/C++ compilers can do many magical things but this is still out of reach.

So what can be done? There is a simple mechanism that can be used here: Just run this algorithm four times and iterate the input bytes over the four hash states. This will effectively split the input bytestream into four seperate streams that can be hashed in parallel with vector instructions. The output will be 128bit which then can be hashed down into 64 or 32 bit output using regular DJBX33A.

I hereby call this algorithm X4DJBX33A.

I had some time during the last days and implemented X4DJBX33A using Intel intrinsics in several instructions sets and variants. Here are some benchmark results using an Intel Core-i7 4960HQ 2.6GHz CPU with gcc-4.8 on linux/amd64:

DJBX33A reference C: 1198 MiB/s
DJBX33A optimized C: 1614 MiB/s
X4DJBX33A reference C: 1039 MiB/s
X4DJBX33A optimized C: 3119 MiB/s
X4DJBX33A mmx vectorized: 3484 MiB/s
X4DJBX33A sse2 vectorized: 6196 MiB/s
X4DJBX33A ssse3 vectorized: 6658 MiB/s

Not bad, huh?

I published the code and the benchmarks on github: cleeus/hashx4. I started looking into other hash functions that are used in important hashtable implementations (namely siphash24 that is used in python 3.4 and later).

Looking at the benchmark results, there are also some important lessons to be learned:

If it's not an emberrassingly parallel algorithm, make it one (and then vectorize).

The DJBX33A algorithm can be implemented very fast but it is hard to vectorize. The modified version which I call X4DJBX33A seems to be well suited for running on 128bit or 64bit wide vector registers. When it is not vectorized, it is slower than a well implemented vanilla DJBX33A algorithm.

The SSSE3 version is an experiment.

SSSE3 has number of new integer intrinsics. Among them a mighty _mm_shuffle_epi8 that can be used to reorder all 16 bytes in a 128bit xmm register into an arbitrary permutation. Using this opcode can lead to an alternative implementation that doesn't use the two _mm_unpack instructions. On some CPUs this seems to be faster, on most slower CPUs it is not.

Opcode scheduling is important, so use intrinsics.

Benchmarks with -O0 and -O1 builds have shown that even the MMX and SSE implementations get noticeably slower when not optimized. A look at the disassembled binaries of the -O2 and -O3 optimized builds shows that the compiler reorders instructions. It probably does this to enhance the instruction level parallelism and provide the CPU with useful instructions while it is still waiting for some RAM I/O to complete. Using intrinsics instead of raw assembler code allows the developer to leverage the wisdom of the compiler.

The C implementation performance varies widely with the compiler.

MSVC and GCC seem to produce very different code from the C implementations. This is not surprising as the research on auto-vectorization and codegen is still ongoing. The SSE2 version seems to be much more stable across compilers and platforms.

Know your instruction set.

I struggled quite a bit with the SSE2 instruction set and initially failed to produce a vectorized version that was faster than the scalar one. This was due to insufficient knowledge of the instruction set. In the end learning all the intrinsics and their performance characteristics is what enables a developer to find a good solution.

Alignment matters.

Besides the reference C implementations, I produced optimized (but still C-based) versions of DJBX33A and X4DJBX33A. A major optimization was to hash the input with a simple implementation until an alignment boundary of 16 bytes in the input memory block was reached. Then the compiler gets a hint to assume that the pointer is aligned to a 16 byte boundary. After the hint, an inner loop which hashes 16 byte chunks and an outer loop which iterates the inner loop is run. This keeps the alignment assumption. This assumption allows the compiler to use opcodes that rely on alignment and enables auto-vectorization.

SSE2 is everywhere. Use it.

If you are on a 64bit X86 processor, you are guaranteed to have SSE2. On 32bit X86, every processor sold in the last 10 years has SSE2. From an economic point of view you can probably ignore non-SSE2 x86 CPUs or just provide one C implementation.

«newer posts || older posts»