Today we take memory for granted. As a developer you often don't think of it anymore and lavishly reserve buffers for all kinds of stuff. That XML file? Just throw it into a DOM parser. That ZIP file? Just decompress it into a memory buffer and write that to disk. That email attachment? Just decode it in memory and see what it is. This is perfectly ok, because if you would think for half an hour about every chunk of memory you're about to reserve, you would get nothing done on time.
But, there is always a but.
The day will come when that XML file is a database dump with gazillions
of nodes or it contains a huge chunk of encoded binary and
your new
/ malloc
just cannot get you fresh memory anymore.
The day will come, when some email user attaches a 9GB ISO image to an email
and tries to send it over your filtering SMTP server (I've heard, chief executives
really like to do that).
Yes, new
can actually fail. On 32bit architectures you have a theoretical
raw address space of 4GB. But your process layout, stack, libraries and operating
system restrict that considerably so that only about 2GB (at least on windows)
are left for use.
Now the heap, where most of your long-living memory chunks will be allocated,
is like a harddisk: it can be fragmented. That means, if you sprinkled small objects
all over your heap, chances are your memory manager (new/malloc) won't find
a large continous chunk anymore. And that's when it will fail and throw an exception.
When that happens, and you really need to solve the problem, it's like standing with your back to the wall. You have two choices:
- Offload large chunks to the harddisk, or
- rewrite everything to be a streaming architecture.
Well, 1. is kinda obvious: Go to the places where your code wants to get huge chunks of memory and replace it with read/write access to temporary files. That's ugly, but it works and is often the only option you have in existing codebases.
Now what does 2. mean? A streaming architecture?
In a streaming architecture you can only read and write fixed size chunks at once. That is your program reserves some fixed buffer and then repeatedly reads into it, runs some processing on it and finally writes the buffer content to some sink - repeat. Your read sources and write sinks could be files, sockets, devices, whatever. The thing is: Your overall design has to consider this requirement up front. Otherwise you've lost. Converting an existing step-by-step design into an on-the-fly streaming architecture is close to a full rewrite.
Creating such an architecture can be hard. It might require some dirty hacks, when the output needs calculated values (like checksums) in the beginning when they are only known after fully running through the input once. Or it might require seekable input stream semantics when you need to know the the size of your input before you have fully processed it. That's why designing clever output formats is so difficult.
But it's worth it in the long run. What would you think if tar -xz
suddenly crashed with an out of memory
exception?
So, next time you process a file: Do it in chunks! :)