blog rss feed

Today we take memory for granted. As a developer you often don't think of it anymore and lavishly reserve buffers for all kinds of stuff. That XML file? Just throw it into a DOM parser. That ZIP file? Just decompress it into a memory buffer and write that to disk. That email attachment? Just decode it in memory and see what it is. This is perfectly ok, because if you would think for half an hour about every chunk of memory you're about to reserve, you would get nothing done on time.

But, there is always a but. The day will come when that XML file is a database dump with gazillions of nodes or it contains a huge chunk of encoded binary and your new / malloc just cannot get you fresh memory anymore. The day will come, when some email user attaches a 9GB ISO image to an email and tries to send it over your filtering SMTP server (I've heard, chief executives really like to do that).

Yes, new can actually fail. On 32bit architectures you have a theoretical raw address space of 4GB. But your process layout, stack, libraries and operating system restrict that considerably so that only about 2GB (at least on windows) are left for use. Now the heap, where most of your long-living memory chunks will be allocated, is like a harddisk: it can be fragmented. That means, if you sprinkled small objects all over your heap, chances are your memory manager (new/malloc) won't find a large continous chunk anymore. And that's when it will fail and throw an exception.

When that happens, and you really need to solve the problem, it's like standing with your back to the wall. You have two choices:

  1. Offload large chunks to the harddisk, or
  2. rewrite everything to be a streaming architecture.

Well, 1. is kinda obvious: Go to the places where your code wants to get huge chunks of memory and replace it with read/write access to temporary files. That's ugly, but it works and is often the only option you have in existing codebases.

Now what does 2. mean? A streaming architecture?

In a streaming architecture you can only read and write fixed size chunks at once. That is your program reserves some fixed buffer and then repeatedly reads into it, runs some processing on it and finally writes the buffer content to some sink - repeat. Your read sources and write sinks could be files, sockets, devices, whatever. The thing is: Your overall design has to consider this requirement up front. Otherwise you've lost. Converting an existing step-by-step design into an on-the-fly streaming architecture is close to a full rewrite.

Creating such an architecture can be hard. It might require some dirty hacks, when the output needs calculated values (like checksums) in the beginning when they are only known after fully running through the input once. Or it might require seekable input stream semantics when you need to know the the size of your input before you have fully processed it. That's why designing clever output formats is so difficult.

But it's worth it in the long run. What would you think if tar -xz suddenly crashed with an out of memory exception?

So, next time you process a file: Do it in chunks! :)

One part of learning a language is getting to know the standard runtime / library environment. In all modern languages there are huge standard libraries and they make up a big part of what the language feels like, and how efficient you are as a developer in your day-to-day job.

In some languages these libraries are closely interwoven with the language itself. Take Java and it's String class. It's part of the library and located in the package java.lang. but it's so close to the language that every string literal is converted to a String instance.

Or take the Python and its Exception classes. Even when basic language expressions, like import package fail (e.g. because package cannot be found) an Exception from the package exceptions. is thrown.

C++ takes a somewhat different approach.

C++ is rigorously divided into core language features and a standard library that is implemented using these language features. C++ core language features are things like:

  • basic data types, arrays, pointers
  • functions
  • classes, inheritance, polymorphism
  • templates
  • exceptions
  • runtime type information

The standard library in C++ provides a number of mostly generic templates (classes and functions) and is called the Standard Template Library - STL.

The basic building blocks of the STL are containers, iterators and algorithms. Containers are standard computer science data structures - vectors, linked lists, priority queues, (hash) maps, sets, etc..

When using them, you will sometimes miss some helpfull member functions. E.g. there are no .sort, no .revert, no .search member function in the std::vector class. At first you'll find that really annoying. But like most of the things in C++, there is deeper meaning.

As said above, there is a collection of algorithms in the <algorithm> header. These standard algorithms like sort, replace, copy, for_each work on all these standard containers. Now why are they decoupled? The argument is simple and is called the NxM problem:

Given N containers and M algorithms, you have to maintain NxM implementations of these algorithms.

This is costly and error prone. The inventors of the STL felt the need to solve this problem and invented a glue, a unified interface that can expose some of the properties of containers so that generic algorithms can work on them. This glue is iterators.

An iterator is like a fancy reference/pointer to an element of a container. The weakest type of iterators supports the ++ increment operator and the * dereference operator. That means you can make the iterator point to the next element in the container and access what it's pointing to. The most powerful iterators expose a random access subscript operator []. Going on from this abstraction, you can implement algorithm template methods like std::copy that work equaly on raw C arrays and doubly linked lists. Yes, you can use the STL algorithms on raw C arrays.

Try this:

#include <algorithm>
#include <iostream>

int main(int, char**) {
  int foo[42];
  const size_t foo_size = sizeof(foo)/sizeof(foo[0]);
  int bar[foo_size];

  for(size_t i=0; i<foo_size; i++) {
    foo[i] = i;
  }

  std::copy(foo, foo+foo_size, bar);

  for(size_t i=0; i<foo_size; i++) {
    std::cout << bar[i] << " ";
  }

  return 0;
}

And next time, we'll take a look at the assembly :)

great musik, great people, love everywhere :)

bookmark: Der Fall Böse - Raus

As most of you may know, on my job I'm not coding one of those new fancy languages but old-school, hardcore, C++. Now some may scream: OMFG, C/C++ - it causes security bugs, it's verbose code, it's not cross platform, it's so 90's - your job must be terrible. But after more then a year, I can only say: I love it. In this and probably some of the next posts, I'll try to show some of the things I like.

Admitted, C++ is not for the faint-hearted. You have to be prepared to get engaged with the language and the machine. Once you got that, you will see that:

  • C++ is one of the most portable languages
  • C++ (in not-C) seldomly causes security bugs (no more then say Java).
  • You get rewarded with uncompared performance.

Of course you always have to choose the right tool for the job. If I had to make a bunch of half educated "IT specialists" code some random business application, I'd probably not choose C++ because they wouldn't have a clue why it crashed all the time (Java could be a good tool here). Or if I'd want to write a tiny tool to do some text processing, I'd just use python ;). But with clever coders and high performance requirements, C++ is still an excelent choice.

And the performance gains can be dramatic. During my diploma thesis (together with Daniel Käs), I did some heavy statics calculations on vast amounts of data. One of those was to calculate a distance matrix. That is you have a number of points (vectors) with N dimensions (talking about N>20000 here) and want to know the distance of each point to every other point. That's a distance matrix. Now our distance function was the cosine similariy measure which is a little more demanding, but not much.

The first implementation was pure python. It took ages. We were effectively unable to repeat the calculations more then once a day. Then I just ported the distance function to C (using Cython). That already improved performance by a factor of 2-3. But the real performance gain came, when I also ported the inner for-loop into C (which, from a python point of view reduced complexity from O(NxN) to O(N) and from a C point of view increased complexity from O(1) to O(N)). That was another speedup of like x10.

Now that was C, but C++ can be equally fast. I just recently discovered what a good optimizer can do from some template code and carefull inlining. Let's take a look at the "workhorse class of the STL" (S.T.Lavavej): std::vector. Some even more old-school and hardcore C coders worry that using vectors instead of plain old arrays is considerably slower. But this is not really true. A vector is a dynamically growable array-like structure. It has the subscript operator [] overloaded which means, you can even use it like an array. But it has comfortable features like a .size() method and a .push_back(...) method. Let's say we have a class Foo and make a vector of Foos and assign it some values:

#include <vector>;

class Foo {
public:
  int get_bar() const { return bar; }
  Foo& operator=(int val) { bar = val; return *this; }
private:
  int bar;
};


int main(int, char**) {
  std::vector<Foo> foos;
  foos.resize(100);

  for(std::vector<Foo>::size_type i=0; i&lt;foos.size(); i++) {
    foos[i] = i;
  }

  return 0;
}

std::vector<Foo>::size_type is the correct type for indexing foo, you could also say size_t here. Assigning i to foos[i] works, because the assignment operator = in Foo is overloaded.

Is that assignment loop slower than a plain C loop over an array of ints? Let's find out and ask ourselfes: What would gcc do? At first it needs to know the definition of the subscript operator [] on the vector template. The following implementation is from libstdc++ 4.4

reference operator[](size_type __n) {
    return *(this->_M_impl._M_start + __n);
}

Modern compilers like gcc decide for themselfes when it's a good idea to inline code. Inlining is good when the inlined function is only a few lines of code. The subscript operator certainly is a good candidate for inlining. So let's do that, and see what our loop becomes:

for(std::vector<Foo>::size_type i=0; i<foos.size(); i++) {
    *(foos._M_impl._M_start + i) = i;
}

Looks like some good old pointer math. But that = operator is still a method. It's also small and certainly will be inlined, too. Let's also inline that one (for brevity, only the asignment statement, the loop stays untouched):

(*(foos._M_impl._M_start + i)).bar = i;

The next step is to resolve the . member access operators. What they do is nothing more than adding the offset of the member to the base address (this-pointer) of the object. Now a good compiler has the 0-offset optimization, that is: there is no additional data between the this pointer and the first class member (unless there is a virtual function member in which case the compiler has to add a vtable). Our Foo class only has the .bar member which is it's first member. So the offset to the this pointer is 0, so it can be eliminated.

*((foos._M_impl._M_start + i)+0) = i;

which is obviously equal to:

*(foos._M_impl._M_start + i) = i;

Now we have to do the same with those strange _M_impl and _M_start members in the STL vector implementation. _M_impl is an intentionally uglyfied name for the internal proxy object that manages vector memory. And it just happens to be the case, that _M_impl is the first data member in vector. It's of type _Vector_impl which is a nested class in _Vector_base. _M_start is a raw pointer (Foo*) to the memory block inside the _Vector_impl and also happens to be the first data member in _Vector_impl. Our optimization can go on and eliminate _M_impl and _M_start:

*((int*)(&foos + 0 + 0) + i) = i;

which is equal to:

((int*)&foos)[i] = i;

This looks absolutely equal to flat memory access on an int[] C array. And for the non-believers, we'll now have a look at the emitted assembler code. First the C version (compiled with gcc version 4.4.5, on x86_64 with -O2).

int foos[100];
int i;
for(i=0; i<sizeof(foos)/sizeof(int); i++) {
  foos[i] = i;
}

which compiles and disassembles to:

#0000000000000830 <main>:
# ... snip ... (function prologue)
# i=0 (eax is our i)
xor   %eax,%eax
# foos start address (rsp) -> rbp, rbx and rdx
mov   %rsp,%rbp
mov   %rsp,%rbx
mov   %rsp,%rdx
# some NOP / prefetch instruction
nopl   (%rax)
# foos[i] = i
mov    %eax,(%rdx) #loop begin / 0x858
# i++
add    $0x1,%eax
add    $0x4,%rdx
# test i<100 ?
cmp    $0x64,%eax    
# goto loop-start
jne    858 <main+0x28> #loop begin

Pretty clever gcc, huh? ;) It notices, that we're accessing foos with the loop counter. It then allocates a register (rdx) and increments this register by 4 (the size of an int) in each loop round. This way it doesn't actually have to add i to the start address of foos. Well let's see what it can do with the C++ code from above.

#0000000000000a10 <main>:
# ... snip ...
# 100 -> edx
mov   $0x64,%edx
# ... snip ...
#foos.size() inlined -> rdx
mov   0x10(%rsp),%rdi
mov   0x18(%rsp),%rdx
sub   %rdi,%rdx
sar   $0x2,%rdx
#initial test on size() 
test  %rdx,%rdx
je    a84 <main+0x74>  #a84 is behind loop
# i = 0
xor   %eax,%eax
# some NOP / prefetch instruction
nopl  (%rax)
# foos[i] = i
mov   %eax,(%rdi,%rax,4) #loop begin / 0xa78
# i++
add   $0x1,%rax
# test i<100 ?
cmp   %rdx,%rax
# goto loop begin
jb    a78 <main+0x68> #loop begin

Quite different, I have to admit. But it's just logical. In the C version, gcc can assume that the size to compare i with is allways greater 0, so it can skip the initial check. In C++ it cannot do this, because .size() could return a negative number (see a6e). But what we can see here is another optimization. .size() normally is a method. But it's also getting inlined (no call, just memory access). And not only that, it's also moved out of the loop and the result stored in register rdx (see a5d). The only other major difference is the operation used to move the value of i into the memory. The C++ version uses a relative addressing mode where the CPU has to do additional internal calculations. This might be a bit slower the the C version.

The C version might be faster, but it's not a whole order of magnitude or something like that. It's just a matter of loop invariant analysis and choosing the one or the other opcode.

But don't forget what you get for that tiny little bit of performance impact: You're getting all the benefits of a fully functional dynamic-size storage. In debug mode, you're usually even getting bounds checking. You're getting exception-safety (that is at least the weak property: no resource leaks). And in release mode, you have that nice and clean near-C speed.

C++ is a type safe, high level and abstract language and still let's you write code that is amazingly fast. And this is only one of the reasons, why C++ is worth beeing loved :)

A new star is born.

Well ... stars ... there are many and most of them are not that bright either. But I was looking for something new to create a website with. That old horrible joomla/php thing broke apart after some php update (I don't remember which one) and was way to much for me either.

So I was looking for something fresh. As I already just uploaded static snapshots of the old php cms (via httrack and some sed+curl scripts), a static site compiler seemed like a very appealing idea to me.

A static site compiler does something really simple: It takes a number of input and configuration files, runs them through it's processing machinery and spits out a collection of HTML and auxiliary files (aka: a website ... you know like websites from the 90's). You can then simply upload and statically serve that site (or even publish it with your favourite VCS).

No PHP. No database. No security bugs. Simple, fast, static fileserving. Such a site might even survive that slashdot link on your average shared web host.

I tried a few existing ones but none of them was what I needed. When I have to install a dozen additional libraries (for which my distribution doesn't even have packages yet), before I get a tool up and running, there is something wrong. And there is also something wrong, when it pulls in some huge template engine for which I first have to learn a new language before I can get started.

So none of the tools I tried had all of the following properties:

  • simple to use
  • simple to install
  • few dependencies
  • supports blogging, rss feeds and simple pages
  • low maintenance

I had an itch.

Now in a few weekends I hacked together a few hundred sloc of python sriptness. It's far from done, it's completely undocumented, it's not (yet) unittested. But it only needs a python runtime and it can already put together this blog :)

Stay tuned, there's more to come.