cleeus.de

Kai Dietrich / 2011-07-21T23:42

As most of you may know, on my job I'm not coding one of those new fancy languages but old-school, hardcore, C++. Now some may scream: OMFG, C/C++ - it causes security bugs, it's verbose code, it's not cross platform, it's so 90's - your job must be terrible. But after more then a year, I can only say: I love it. In this and probably some of the next posts, I'll try to show some of the things I like.

Admitted, C++ is not for the faint-hearted. You have to be prepared to get engaged with the language and the machine. Once you got that, you will see that:

C++ is one of the most portable languages
C++ (in not-C) seldomly causes security bugs (no more then say Java).
You get rewarded with uncompared performance.

Of course you always have to choose the right tool for the job. If I had to make a bunch of half educated "IT specialists" code some random business application, I'd probably not choose C++ because they wouldn't have a clue why it crashed all the time (Java could be a good tool here). Or if I'd want to write a tiny tool to do some text processing, I'd just use python ;). But with clever coders and high performance requirements, C++ is still an excelent choice.

And the performance gains can be dramatic. During my diploma thesis (together with Daniel Käs), I did some heavy statics calculations on vast amounts of data. One of those was to calculate a distance matrix. That is you have a number of points (vectors) with N dimensions (talking about N>20000 here) and want to know the distance of each point to every other point. That's a distance matrix. Now our distance function was the cosine similariy measure which is a little more demanding, but not much.

The first implementation was pure python. It took ages. We were effectively unable to repeat the calculations more then once a day. Then I just ported the distance function to C (using Cython). That already improved performance by a factor of 2-3. But the real performance gain came, when I also ported the inner for-loop into C (which, from a python point of view reduced complexity from O(NxN) to O(N) and from a C point of view increased complexity from O(1) to O(N)). That was another speedup of like x10.

Now that was C, but C++ can be equally fast. I just recently discovered what a good optimizer can do from some template code and carefull inlining. Let's take a look at the "workhorse class of the STL" (S.T.Lavavej): std::vector. Some even more old-school and hardcore C coders worry that using vectors instead of plain old arrays is considerably slower. But this is not really true. A vector is a dynamically growable array-like structure. It has the subscript operator [] overloaded which means, you can even use it like an array. But it has comfortable features like a .size() method and a .push_back(...) method. Let's say we have a class Foo and make a vector of Foos and assign it some values:

#include <vector>;

class Foo {
public:
  int get_bar() const { return bar; }
  Foo& operator=(int val) { bar = val; return *this; }
private:
  int bar;
};


int main(int, char**) {
  std::vector<Foo> foos;
  foos.resize(100);

  for(std::vector<Foo>::size_type i=0; i&lt;foos.size(); i++) {
    foos[i] = i;
  }

  return 0;
}

std::vector<Foo>::size_type is the correct type for indexing foo, you could also say size_t here. Assigning i to foos[i] works, because the assignment operator = in Foo is overloaded.

Is that assignment loop slower than a plain C loop over an array of ints? Let's find out and ask ourselfes: What would gcc do? At first it needs to know the definition of the subscript operator [] on the vector template. The following implementation is from libstdc++ 4.4

reference operator[](size_type __n) {
    return *(this->_M_impl._M_start + __n);
}

Modern compilers like gcc decide for themselfes when it's a good idea to inline code. Inlining is good when the inlined function is only a few lines of code. The subscript operator certainly is a good candidate for inlining. So let's do that, and see what our loop becomes:

for(std::vector<Foo>::size_type i=0; i<foos.size(); i++) {
    *(foos._M_impl._M_start + i) = i;
}

Looks like some good old pointer math. But that = operator is still a method. It's also small and certainly will be inlined, too. Let's also inline that one (for brevity, only the asignment statement, the loop stays untouched):

(*(foos._M_impl._M_start + i)).bar = i;

The next step is to resolve the . member access operators. What they do is nothing more than adding the offset of the member to the base address (this-pointer) of the object. Now a good compiler has the 0-offset optimization, that is: there is no additional data between the this pointer and the first class member (unless there is a virtual function member in which case the compiler has to add a vtable). Our Foo class only has the .bar member which is it's first member. So the offset to the this pointer is 0, so it can be eliminated.

*((foos._M_impl._M_start + i)+0) = i;

which is obviously equal to:

*(foos._M_impl._M_start + i) = i;

Now we have to do the same with those strange _M_impl and _M_start members in the STL vector implementation. _M_impl is an intentionally uglyfied name for the internal proxy object that manages vector memory. And it just happens to be the case, that _M_impl is the first data member in vector. It's of type _Vector_impl which is a nested class in _Vector_base. _M_start is a raw pointer (Foo*) to the memory block inside the _Vector_impl and also happens to be the first data member in _Vector_impl. Our optimization can go on and eliminate _M_impl and _M_start:

*((int*)(&foos + 0 + 0) + i) = i;

which is equal to:

((int*)&foos)[i] = i;

This looks absolutely equal to flat memory access on an int[] C array. And for the non-believers, we'll now have a look at the emitted assembler code. First the C version (compiled with gcc version 4.4.5, on x86_64 with -O2).

int foos[100];
int i;
for(i=0; i<sizeof(foos)/sizeof(int); i++) {
  foos[i] = i;
}

which compiles and disassembles to:

#0000000000000830 <main>:
# ... snip ... (function prologue)
# i=0 (eax is our i)
xor   %eax,%eax
# foos start address (rsp) -> rbp, rbx and rdx
mov   %rsp,%rbp
mov   %rsp,%rbx
mov   %rsp,%rdx
# some NOP / prefetch instruction
nopl   (%rax)
# foos[i] = i
mov    %eax,(%rdx) #loop begin / 0x858
# i++
add    $0x1,%eax
add    $0x4,%rdx
# test i<100 ?
cmp    $0x64,%eax    
# goto loop-start
jne    858 <main+0x28> #loop begin

Pretty clever gcc, huh? ;) It notices, that we're accessing foos with the loop counter. It then allocates a register (rdx) and increments this register by 4 (the size of an int) in each loop round. This way it doesn't actually have to add i to the start address of foos. Well let's see what it can do with the C++ code from above.

#0000000000000a10 <main>:
# ... snip ...
# 100 -> edx
mov   $0x64,%edx
# ... snip ...
#foos.size() inlined -> rdx
mov   0x10(%rsp),%rdi
mov   0x18(%rsp),%rdx
sub   %rdi,%rdx
sar   $0x2,%rdx
#initial test on size() 
test  %rdx,%rdx
je    a84 <main+0x74>  #a84 is behind loop
# i = 0
xor   %eax,%eax
# some NOP / prefetch instruction
nopl  (%rax)
# foos[i] = i
mov   %eax,(%rdi,%rax,4) #loop begin / 0xa78
# i++
add   $0x1,%rax
# test i<100 ?
cmp   %rdx,%rax
# goto loop begin
jb    a78 <main+0x68> #loop begin

Quite different, I have to admit. But it's just logical. In the C version, gcc can assume that the size to compare i with is allways greater 0, so it can skip the initial check. In C++ it cannot do this, because .size() could return a negative number (see a6e). But what we can see here is another optimization. .size() normally is a method. But it's also getting inlined (no call, just memory access). And not only that, it's also moved out of the loop and the result stored in register rdx (see a5d). The only other major difference is the operation used to move the value of i into the memory. The C++ version uses a relative addressing mode where the CPU has to do additional internal calculations. This might be a bit slower the the C version.

The C version might be faster, but it's not a whole order of magnitude or something like that. It's just a matter of loop invariant analysis and choosing the one or the other opcode.

But don't forget what you get for that tiny little bit of performance impact: You're getting all the benefits of a fully functional dynamic-size storage. In debug mode, you're usually even getting bounds checking. You're getting exception-safety (that is at least the weak property: no resource leaks). And in release mode, you have that nice and clean near-C speed.

C++ is a type safe, high level and abstract language and still let's you write code that is amazingly fast. And this is only one of the reasons, why C++ is worth beeing loved :)