How many cities has the world?
So this is what I did this weekend: For my diploma thesis I needed a list of all the cities of the world (to filter text collections). There already exist corpora like the “Getty Thesaurus of Geographic Names” but they are not free and way over my needs - I just want a list.
Who could have such a list? First try: IATA airport list from wikipedia. Parsed. Erroneous (typos, problems with parsing…). Second try: openstreetmap.org.
And here is where the story begins. Openstreetmap.org collects geographical knowledge in a huge database and publishes it under a free license (CC-BY-SA). The database dump can be downloaded as a huge XML file which currently is about 5GB in bzip2 compression. The expected ratio for decompression is 10:1 so this is a fricking 50GB XML file. The structure is fairly simple. There are nodes, ways and relations and these can have tags. The tags then encode information like “this is a place of the kind city” or “its name is Footown”. To get this information I wrote a little python script that iterates over every line and extracts the relevant parts with some regular expressions. A run on the 5MB bzip2ed relations file took 17s. Well … that was to slow. So I removed the regexps and did some dirty by-hand parsing. 8s. Better. But still to slow.
So, next step: C.
The evil thing!
But, you have to admit - there’s nothing else when it comes to performance. About six hours of coding later I got the first results: 3.5s for a 1,170,000 lines XML file. Good. After some further improvements I got it down to 2.6s. Yeah! :)
On a 1.6GHz Core2 Duo the combination of bzcat and grepplaces.c (both running on one core) gives around 1.2MB/s reading speed on the bzip2ed planet-file. So a complete scan over the planet-file now takes about 70 minutes.
So, here’s the code: http://cleeus.de/grepplaces/
The extracted corpus will follow, as soon as it’s ready.
So long, some statistics:
$ grep “^city” places_planet.txt | wc -l
4179
$ grep “^town” places_planet.txt | wc -l
29401
$ grep “^village” places_planet.txt | wc -l
249716
“Language Level Experience
C/C++ Wizard 8 years”
Do you count the years between now and your first hello world at this point or the years you actually coded a lot of C? Looks like rather the first option to me looking at the grepplaces code. SCNR.
Haha, yeah — it’s something in between, my first HelloWorld was … wait … in the year 1999. I think I’ll have to change that, or better remove it. There was a time when I thought of myself as a C/C++ wizard, but that was before I have seen any good code. Grepplaces is quick and dirty as is most of my C. I have alot of lines of C/C++ behind me, but that doesn’t make me a good C/C++ coder automatically.