Welcome to Arkanis Development

Measurements of system call performance and overhead


I was kind of bored for once and figured I could clean up an old microbenchmark and kick the results out of the door. This is gonna be another very technical post for programmers. Be prepared. ;)

A while ago I came across a paper about "exokernels", very light-weight operating system kernels that leave most of the work up to libraries (Exokernel: An Operating System Architecture for Application-Level Resource Management). The paper is from 1995 and among other things they implemented high performance system calls.

I always heard system calls are expensive but I never saw any real numbers or measurements. Only the stuff the C standard library was doing with fread() and fwrite(): Batching I/O to minimize system calls. But system calls evolved a lot since the x86 32 bit days and today we have a dedicated syscall instruction for them.

This got me wondering: How fast are system calls on Linux today compared to normal function calls? In the paper they compared a function call with no arguments to the the getpid() system call. That system call just does the minimum of work: Switch into kernel mode, read one variable and return its value. Sounds simple, so I wrote a set of small microbenchmarks and did the same on my Linux notebook.

When I asked my brother for some old CPUs he gave me this…
…an entire stack of CPU chips.

Back when reading the paper I just ran the microbenchmarks on my own notebook (an old Intel Core 2 Duo T6600 from 2009). But right now I'm with my brother and he owns quite a few computers and I asked him to run the microbenchmarks on some of them. We also run the benchmarks on some older CPUs (see the photo :)). Anyway, it gave me a nice perspective of how the performance of system calls evolved over time.

Mind you, I just took whatever PC was close by (no idea how fast or slow) as well as a few older CPUs. I was curious how system call performance changed over the years on x86_64 and don't really care about what CPU is faster or slower. Don't take these numbers and say that CPU X is faster than CPU Y. It just doesn't make sense. System calls are one of many factors that make up the performance of a CPU and software but by no means a defining one. So don't take the numbers to serious. This is just for fun after all. ;)

Before we get to the numbers I have to explain a bit about how Linux optimizes some system calls: On x86 64 bit system calls are usually made through the syscall instruction. You put the arguments into the proper registers and the syscall instruction then transitions into kernel mode and calls the proper kernel function. This incurs some overhead since the CPU has to switch into a different address space and this might need updates to the TLB, etc.

For some system calls the Linux kernel provides optimized versions via the vDSO. In some cases these optimized versions don't need to transition into kernel space and thus avoid the overhead. By default the C runtime on Linux (glibc) uses these optimized functions automatically whenever there is one available. I'm interested in how both of these mechanisms perform.

On to the numbers. Just for a bit of context: Reading a value from the main memory takes about 100ns, see Latency Numbers Every Programmer Should Know.

Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time. The vDSO implementation of the getpid() system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call. On the Intel Celeron D 341 from 2004 the a system call via the syscall instruction was about 25 times slower than a system call via the vDSO. On the Intel Core i7-4790K from 2014 it's only about 12 times slower. For me I'll use 10 times slower as rule of thumb for modern CPUs and 25 times for older CPUs.

In case you're wondering about the details:

  • All benchmarks were run on Linux Mint booted via USB stick (Linux 4.4.0-21-generic)
  • The benchmarks execute the function call or syscall 10 million times in a loop.
  • The benchmark is run 10 times. The average of those runs is then divided by 10 million to get the time per function or system call.
  • The function call benchmark is compiled with GCC (I also did it by hand in assembler but the result was almost the same).
  • The syscall benchmark is programmed by hand in assembler using yasm.
  • The vDSO benchmark is a small C program compiled with GCC. When you use the "normal" C functions for system calls glibc automatically uses the vDSO.
  • I've put the source code and results of the microbenchmarks into a repository. Feel free to look around in case you have any questions. You can also find a lot more graphs there.

Performance of fread() and fwrite() I/O batching

I was also wondering about the I/O batching the standard C library does with fread() and fwrite(). Is it really worth the effort? Or are that many function calls wasted since system calls can be pretty fast, too?

A few microbenchmarks answered that question, too. This time I measured 1 million 4 byte reads and writes. Again directly via the syscall instruction and via the vDSO. But this time also via fread() and fwrite() instead of using system calls. Here's what I got:

So, yeah, in this scenario the I/O batching definitely is worth it. It's about 10 to 5 times faster.

But keep in mind, these were 4 byte reads and writes. I took a look with strace and they got batched into 4 KiByte reads and writes. So when you're reading or writing a few KiBytes the speedup is likely not that large.

It's also interesting that the vDSO doesn't help much for the read() and write() system calls. But usually it doesn't hurt either. So maybe these system calls can't be optimized in that way.

If you're still with me: Thanks for reading. I hope my strange little experiment was interesting. :)

Math 3D - A simple vector and matrix library for C


If you've done a bit of graphics programming with OpenGL you're probably familiar with vector and matrix math. Every point or direction in 3D space can be represented as a vector and most manipulations of those can be represented as matrices (moving them around, rotating them, projecting them on the screen, etc.).

I wasn't happy with the C math libraries I found so I wrote a new one (surprise!). Read on if you want to know why. But here is what using it looks like:

mat4_t projection = m4_perspective(60, 800.0 / 600.0, 1, 10);
vec3_t from = vec3(0, 0.5, 2), to = vec3(0, 0, 0), up = vec3(0, 1, 0);
mat4_t transform = m4_look_at(from, to, up);

vec3_t world_space = vec3(1, 1, -1);
mat4_t world_to_screen_space = m4_mul(projection, transform);
vec3_t screen_space = m4_mul_pos(world_to_screen_space, world_space);

OpenGL provides a lot of vector and matrix math for shaders that run on the GPU but on the CPU you have to do it yourself. I like it this way because it allows you to choose what kind of 3D math library you want: One that tries to optimize every last bit of performance but is complicated to use or one that is easy to use but lax on performance. Depending on the project you can pick what you need.

I do pretty much all of my low level graphics programming in C (I know, I'm weird, but OpenGL is a C API after all). And I haven't found a 3D math library I'm happy with for C. I don't know all of them but the ones I've looked at were complicated to use and vague on their semantics. Where they written with the OpenGL or Direct3D conventions in mind? How are matrices laid out in memory? Can I pass them into OpenGL directly or do I have to transpose them before hand?

There are good C++ math libraries that perfectly fit my purpose but I'm an awful C++ programmer. I spend way more time fiddling around with the language than I spend on solving the problem. So I stayed with C. Maybe someone else will find non-templated math code similarly appealing.

All this drove me to write my own small 3D math library for C. Just the basics, nothing fancy. A friend of mine joined in and together we started out from scratch. It was quite a nice learning experience. We spend a lot of time on the whiteboard, calculated a lot of the math by hand and wrote a lot of tests until we understood what the math should actually mean. It was just there that we realized that the perspective divide step in OpenGL is just an fixed function hack to fit a perspective projection into a 4x4 matrix. Kind of pointless with shaders...

Anyway, if you ever do OpenGL stuff in C and need a simple to use math library you can pick it up here: math_3d.h. It's a single header-file library in the style of stb_image.h so it's easy to integrate into projects.

It covers basic 3D vector math, transformation and camera matrices (translation, rotation, scaling and look at), projection matrices (orthographic and perspective) as well as basic matrix math (matrix-matrix, matrix-point and matrix-direction multiplication, inversion of affine transformations). The documentation is in the file itself so take a look if you're interested.

Happy programming. :)

Build your own DynDNS


During the last few weeks I wrote MiniDynDNS to build my own dynamic DNS service. Something like DynDNS but all by myself. This post explains the basic steps needed to wire MiniDynDNS into the worldwide DNS system.

I'm using it to create DNS names that point to devices at home I want to access via the internet. This is pretty nice with IPv6 since every device gets its own public IPv6 address. But please make sure only the services you want to have available are actually listening on the public IPv6 address. Or configure your firewalls accordingly.

To build your own DynDNS you'll need a few bits and pieces:

  • A server with a static IP address. Here we'll use as a placeholder for that IP.
  • A registered domain of your own. example.com shall proudly serve as a placeholder for that domain.
  • Access to the nameserver or DNS records of that domain. We'll need to add some DNS records there.

The bigger picture

The whole idea of this operation is to create a subdomain that is managed by a program running on your server. Here we'll use dyn.example.com but it can be anything as long as it's a subdomain of your registered domain. Whenever someone on the world resolves a name like weather.dyn.example.com they're going to ask that program on your server to get the current IP of that name.

For that we first need a program running on your server that can answer DNS requests and allows us to update these IPs when they change. Obviously we're going to use MiniDynDNS for that. That's why I wrote it.

Second we need to tell the global DNS system that the program running on your server is responsible ("authoritative") for dyn.example.com subdomain. This is called "delegating" a subdomain. When you bought your own domain you also bought the right to delegate subdomains to whoever you deem worthy. With that in place whenever someone resolves a name in dyn.example.com they'll ask MiniDynDNS on your server.

Note that you can only delegate a subdomain to a host, e.g. ns.example.com. This host then has to resolve to You can delegate to whatever host you want but in the end this host has to resolve to your public IP. Here we'll use ns.example.com as a placeholder for that.

The final part is a script running on whatever device or computer you want to have a dynamic domain name for. That script will periodically report its current IP to your MiniDynDNS.

If everything works correctly you can add any devices you want to your dyn.example.com subdomain and access them from everywhere on the world. pi.dyn.example.com, weather-station.dyn.example.com, tv.dyn.example.com, touchtable.dyn.example.com, spidercam.dyn.example.com or whatever. Get creative.

So lets get to it.

1. Run MiniDynDNS on your server

Download or clone MiniDynDNS from GitHub and do the "installation". Basically that's renaming config.example.yml to config.yml and setting the proper values for your setup. The domain, soanameserver and soamail are the important ones.

For MiniDynDNS to answer incoming DNS requests it has to listen on port 53. That's where other servers or clients expect to get DNS requests answered. Changing that will probably break things.

Per default it will use port 80 for a simple HTTP API with which we can update DNS records. In case this port is already used by a webserver like Apache you can change it to something else like 8080. We only need it for the scripts that periodically report the IPs of your devices to the server.

You can tell your server system to start MiniDynDNS on server startup. For me it's just a funny hobby so I leave it running in a screen terminal. You might also need to tell your servers firewall to open port 53 and 80 for incoming traffic (or whatever port you use for the HTTP interface). Otherwise the firewall will block everything and you'll just get timeouts.

Now your basic DynDNS server should already be up and running. To test it you can fire up a terminal and try this command:

nslookup foo.dyn.example.com

This tells nslookup to resolve foo.dyn.example.com by asking the DNS server If everything works well it should tell you that foo.dyn.example.com has the IP address

This assumes that you use the default database (just renamed db.example.yml to db.yml). If you already changed your DB you have to change the domain name in the command accordingly.

2. Delegate dyn.example.com to the MiniDynDNS server

Now on to tell the rest of the world that your server manages the dyn.example.com by itself. For this you have to add two records to your normal nameserver. In my case the company where I registered my domain provides a webinterface for this task.

You have to add these two DNS records to your example.com nameserver:

dyn  NS  ns.example.com
ns   A

The first record tells the DNS system that the dyn.example.com subdomain is delegated to ns.example.com. The second record says that ns.example.com has the IPv4 address Please remember to replace the domain name and the IP with your own values. If your server also has an IPv6 address you should add an AAAA record for that, too.

3. Update your IPs periodically

This differs greatly between IPv4 and IPv6.

With IPv4 only your router has a public IP address. For every service you have to create a port forwarding to the appropriate internal computer or device. So in any way there's only one public IP to report to the DNS server. Many routers already have build-in support for this. Usually it's hidden somewhere in the webinterface called "dynamic DNS", "DynDNS" or something like that.

In the case of my FritzBox router I have to enter the domain (foo.dyn.example.com), a user name (just foo), the password (bar) and an update URL (http://ns.example.com/?myip=<ipaddr>). The router will replace "<ipaddr>" with its current public IPv4 address. It seems like it then just fires this HTTP request every 30 minutes. Again these values are based on the default db.yml file.

The required steps are probably quite different for other routers. You might even have to look into the manual or search a bit to figure out how to do this.

With IPv6 the situation is a bit simpler. Each device gets its own public IPv6 address. A script that runs every few minutes can simply report that IP to the DNS server. In case of a RaspberryPi running Raspbian this script will do the job:

IP=$( \
    ip addr show dev eth0 scope global | \
    grep --perl-regexp --only-matching '(?<=inet6 )2003:[0-9a-f:]+' | \
    head --lines 1 \
curl -s http://foo:bar@ns.example.com/?myip=$IP

Again, "foo", "bar" and "ns.example.com" are values from the default db.yml. Replace them with the values for your setup. In case you changed the port of the webinterface you also have to add a port to the HTTP request (something like http://foo:bar@ns.example.com:8080/?myip=$IP for port 8080).

Save it to /home/pi/update_ip.sh and make it executable with chmod u+x update_ip.sh. When you run it (./update-ip.sh) you should see something like "Your IP has been updated".

To execute the script every 5 minutes you just need to add a line to your crontab. The command crontab -e should show you an text editor and by adding this line at the end you should be set:

*/5  *  *  *  *  /home/pi/update_ip.sh 2>&1 >> /home/pi/update_ip.log

The "*/5" at the beginning means "every 5 minutes". If you want to run the script every 30 minutes its "*/30".


Phew, this was more text than I expected. When you run the command nslookup foo.dyn.example.com you should now see as a result (again, default database values). Note that this command asks the nameserver provided by your environment (e.g. by your ISP). Thanks to the domain delegation this DNS request should end up at your nameserver which can then answer it. When you have a webserver running on one of your devices you can even use these domain names to access websites.

Anyway, with that setup you should be able to manage your own subdomains. The MiniDynDNS readme has some more information and useful commands so better take a look there, too.

Have fun with MiniDynDNS. :)



During the last few weeks I've been working on a small, simple and self-contained dynamic DNS server: MiniDynDNS. A friend of mine wanted to bring one of his computers online but didn't want to use an 3rd party service like DynDNS.

Well, you probably know the typical hacking answer: "Why not build something like that ourselves?". How hard can it be? After all we only need a hash table or dictionary for a simple DNS server. Throw in some code to parse and generate DNS packets and your're done.

After a bit of hacking and testing it actually worked. An HTTP interface would be nice so I can tell my router to report changed IPs by itself. So in it went.

You can probably imagine that the result was a funny but scary bit of code. I did some cleanup and ended up with about 600 lines of Ruby. Some more refactoring into several classes resulted into a bit to much glue code for my taste so I reverted back to simple procedural style. The code isn't pretty or fancy but it's easier to understand this way I hope.

So if you want to role your own little DynDNS for your private stuff, take a look. Maybe it's what you want. But remember: It's only build for small stuff, not high-traffic sites. :)

When does the smallest gap between two floats reaches 1.0?


Programmers might know this question: Can I use a float variable for that? Or should I use double or some integer? During my time as a programmer I came across this question on a few occasions. Usually I wanted to use floats for coordinates, like the x, y and z position of an object in space.

But floats have a problem: The larger the number becomes the larger the gap between two representable floats. For example the next representable float after 512.0 is 512.000061. So the gap between the two numbers is 0.000061. After 65536 the next representable number is 65536.007812 so the gap is 0.007812 large. A float variable simply can't represent anything between any of those two values. It doesn't have enough bits to do so. This is also called the precision of the float: The higher the value the lower the precision.

Some time ago I used floats for the x and y coordinates of a large map with mostly images on it. But I was wondering how large that map (that is the x or y coordinate) can become before before the gap between two float values is larger than one pixel (1.0). If a coordinate exceeds this value I can no longer properly position images on the map and can no longer use floats to calculate the position of individual pixels.

Usually I would guess based on what I read and picked up over the years. But this time I wrote a small C program that prints a table of float values and the size of the gap to the next larger value.

#include <stdio.h>
#include <math.h>

int main() {
    for (float x = 1.0f, precision; precision < 1.0f; x *= 2.0f) {
        precision = nextafterf(x, INFINITY) - x;
        printf("precision of %f at value %.0f\n", precision, x);

    for (double x = 1.0, precision; precision < 1.0; x *= 2.0) {
        precision = nextafter(x, INFINITY) - x;
        printf("precision of %lf at value %.0lf\n", precision, x);

    return 0;

It uses the nextafter() function of C99 and for completenesses sake I also added a table for double. You can compile it with gcc -std=c99 float_precision_table.c -lm.

This is a bit of the programs output:

precision of 0.003906 at value 32768
precision of 0.007812 at value 65536
precision of 0.015625 at value 131072
precision of 0.031250 at value 262144
precision of 0.062500 at value 524288
precision of 0.125000 at value 1048576
precision of 0.250000 at value 2097152
precision of 0.500000 at value 4194304
precision of 1.000000 at value 8388608

At the float value of about 8388608 the gaps are about 1.0 in size. So once the x or y coordinate exceeds this value strange errors will happen in my program. Fortunately this value is quite a bit larger than I expected it to be.

Even if I would use float for coordinates in a 3D world and wanted 1cm precision the world can reach from about -65536m to about +65536m in each direction. That's a cube with a side length of about 130km. Which is again way larger than I originally thought. :)

As a little contrast here are some values for doubles:

precision of 0.007812 at value 35184372088832
precision of 0.015625 at value 70368744177664
precision of 0.031250 at value 140737488355328
precision of 0.062500 at value 281474976710656
precision of 0.125000 at value 562949953421312
precision of 0.250000 at value 1125899906842624
precision of 0.500000 at value 2251799813685248
precision of 1.000000 at value 4503599627370496

You can make some quite large worlds with double coordinates. In the same situation as above it would be a cube with a side length of 470 AU and still something like 1cm precision at the outer regions. :)

I hope this small program can provide some point of reference should you ever find yourself pondering the same questions.

A simple one-file image gallery


First a small disclaimer: This stuff is only useful if you have some webspace lying around and don't want to use a cloud thing or if a full blown gallery is to annoying or complex for your needs.

For the last few years I used a small PHP image gallery to share images (primarily Kerbal Space Program screenshots). I uploaded images via SSH and the script would automatically show all images and create thumbnails when necessary. While this isn't useful for most people it worked quite well for me.

A few days ago I was bored and started to build a new image gallery. This time based on some browser APIs I wanted to tinker with for some time now: Drag & drop and AJAX file uploads. My goal was to drag & drop images directly into the gallery, encode them as JPEGs, create thumbnails and then upload the stuff.

  • Drag & drop because it's simple and it fits quite well with the way I chose which images I want to upload.
  • Because Elite: Dangerous creates screenshots as bitmaps I wanted to convert them to JPEGs before uploading them. This way the browser can also create the thumbnails. No need for any GD library (image processing) on the server side.
  • Finally I wanted to have an efficient upload: Just sending the raw data directly as the request body. No form encoding and without loading the entire file into memory. But this part of the experiment was actually for a different project (uploading large video files into an archive).

I pushed the result up to github. A simple image gallery that's just one PHP file. Drop it into a directory and that directory becomes the gallery (given the webserver has write permission for that directory to store new images there). It's not really production grade software: The browser might hang when you drop a large image and the thumbnails might look jagged with some images. But it's good enough for me at the moment.

Tinkering around with the drag & drop, binary upload and upload progress stuff was quite interesting. It seems to work more or less but I'm still confused about how progress events are triggered (files seem ok but array buffers are a different topic). I was quite glad that Chrome and Firefox could actually upload large files without completely loading them into memory. While useless for the image gallery that will come in handy for a different project.

On the down side I was somewhat annoyed by other aspects of the implementations. Reading a file as a data URL seems to block the browser (at least Firefox) as does rendering the image into a canvas to get the pixel values for the JPEG encoder. I would have liked to put that stuff into a webworker (another thread) but they can't use DOM stuff like image or canvas elements. So no solution there. For normal images it doesn't matter much but for larger images it's quite noticeable.

Sometimes it kind of strikes me that when working with browser I have to switch into "try to creatively abuse a scripted document viewer" mode. To solve a problem you have to gather together and combine components in sometimes funny ways. Usually I like that aspect of web development, it's almost like playing a puzzle game. But sometimes it just becomes smelly. For example to decode an image (get its pixel values) I had to convert it into a data URL (base64 encoded string), create an img element with that URL as its src attribute, draw that image element onto a canvas and finally get the pixel data from there.

There's a good reason browser are like this. After all they're for viewing websites, not offering a complete or sane programming environment. But even in plain C the above task is simple compared to the browser: You can go for a straight solution, just load the image with a library like stb_image. Done. Even resizing (stb_image_resize) and encoding (stb_image_write) is simple.

On the other hand GUI stuff and drag & drop probably isn't as simple in C as it is in browsers. And since it's all about an image gallery comparing it with C is kind of stupid. But a few times I was tempted to just compile this stuff into JavaScript so I can put it into a webworker and stop the browser from hanging. Funny like the old stuff makes me think straight and the new stuff think in puzzles.

Oh, that became quite a detour from the original topic. Shouldn't write blog posts when I'm way to tired… :P

Minimal OpenCL development on Windows


I've been doing some Windows OpenCL stuff recently. One of the things that kind of annoyed me a lot was all the time spend to set things up. Installing Visual Studio (which takes quite some time), finding the proper SDK and praying that everything kind of works. Not realy my kind of fun. And all I wanted to do was to create a program that uses the systems opencl.dll.

So how hard can that be? What do I really need for that? Common sense linking would suggest you only need the header files. With that the compiler knows how to generate code to call the DLL functions. So I gave it a try and it actually worked! (Which honestly surprises me most of the time.)

  • I downloaded MinGW-w64 as standalone compiler. You can use the installer but a 50 MiB archive with a small .bat or .cmd script to add it's bin directory to the %PATH% is just my type of thing. No external dependencies or funny setup stuff. Extract and go. As long as Microsoft doesn't break backward compatibility (unlikely) it will continue to work out of the box. No trouble if I want to pick it up again later on.
  • All I really wanted from the OpenCL SDKs were the OpenCL header files. Well, you can get those directly from the Kronos Groups website. I actually only needed cl.h and cl_platform.h. The others (e.g. cl_egl.h) are optional addons and are not included by cl_platform.h.
  • The OpenCL runtime (opencl.dll) is provided by recent AMD or nVidia drivers. No need for any SDK.

All that remains is to add the OpenCL headers to the compilers include path, compile an OpenCL program and link it against the systems opencl.dll:

gcc -I. main.c C:\Windows\System32\OpenCL.dll -o main.exe

I've put the OpenCL header files into an cl subdirectory. Then added the current directory to the include path. So the compiler will find #include <cl/cl.h>.

And that's it! MinGW-w64, two header files and a recent graphics driver is all you need. :)

I've tested this with OpenCL 1.1 on nVidia (one notebook) and AMD drivers (two different computers). There's so much interesting stuff in OpenCL 1.1. alone (e.g. combining out-of-order GPU and CPU command queues) that I haven't tested newer versions yet.

If you're curious you can find the minimal setup on GitHub. It's a small OpenCL application that takes the first GPU and transforms a hand full of characters on the GPU. Total waste of any processing time but enough to see if it works.

iwatch - run a command when a file changes


iwatch is a small linux command line tool that automatically runs a command every time a file is changed. I've written it for myself several years ago (can't remember when exactly) and have been using it ever since. Actually for a rather lot of stuff. So maybe it's also useful for someone out there too. You can download the C source on GitHub.

Compile it with make iwatch. Or if you don't like make you can directly use GCC in C99 mode: gcc -std=c99 iwatch.c -o iwatch. It's a rather simple tool so it's just one source file (~80 LoC).

Basic usage is something like this:

iwatch files... command

As soon as one of the files is changes command is executed. That's it, no fancy other stuff.

Most of the time I use it to run a test suite as soon as some source files are saved in the text editor. Something like this runs in a terminal somewhere where I can see it:

iwatch tree.c tree.h "make tests/tree_test && ./tests/tree_test"

Whenever tree.c or tree.h are saved in the text editor the test suite is run. When all is fine a green line shows up, otherwise some compiler errors or a red line (and some other stuff) smile at me.

It's rather nice to get immediate feedback during development. Kind of makes testing way more smooth and a normal part of writing code. I grew attached to this "zen testing" stuff in the Ruby world. And hey, why not do it in C, too? Then you have some kind of static type checking (well, C) and testing. It actually makes fun to write C code this way (still sounds weird…).

Apart from programming I also use iwatch for "office" stuff. Well, actually for converting markdown text to a PDF as soon as the markdown file is saved.

iwatch paper.md paper.css "markdown paper.md && prince -s paper.css paper.html"

As soon as the markdown or CSS file is saved the markdown is converted to HTML and then to a PDF (with Prince and a few lines of CSS). I usually view the PDF with the default GNOME PDF viewer (Evince) and it does update the view every time the PDF changes. This gives pretty much instant view of the finished PDF every time the markdown file is saved. Kind of the live preview of some LaTeX editors.

These are the two situations where I use iwatch regularly. I use it on some other occasions, e.g. keeping it running in a screen to update statistics every time a website creates a file in a directory. Simple tool but can be combined with a lot of different stuff. :)

Oh, about the name. It uses inotify to watch for file changes so "iwatch" it is. This piece of C code is rather old, from a time before fanotify was known. So it has nothing to do with something declared as "smart" or any real watches.

I actually wanted to extend the tool before publishing it. Adding a way to insert the name of the changed file into the command, maybe using fanotify, etc. However it already does it's job quite well for me. So if someone needs a bit more feel free to throw in some lines of C code.

Fast line iteration in PHP


It's been a long time since I wrote anything here. During the last 2 years of study I haven't had much time to finish anything. Well, now I have the time. One of the first things on the list is an optimized mail parser completely written in PHP (so you don't need to install any extensions). Since mails are a line based format the parser will have to iterate over lines quite often. So it's worth to do some microbenchmarks on that.

I'll focus on two use cases:

  • Iterate over lines from a PHP stream. For example an opened file (MBox, mail dir, …), a network connection (HTTP, IMAP, NNTP, …) or any other PHP stream.
  • Iterate over lines of a memory buffer. The data is already loaded into one large string. This can be useful to reduce method call overhead. The parser can be called only once for the entire buffer instead of once for each line.

Please be aware that this is a microbenchmark! Results can very quite a bit depending on the rest of your code. The values here are a starting point for your own function or library microbenchmarks not a definitive answer.

Test setup

All tests were performed with PHP 5.4.6 on Linux. The test data was a 508 MiByte dump of the digitalmars.D newsgroup with CRLF ("\r\n") line breaks and 12 930 435 lines (on average 41.2 Bytes per line). The NNTP, IMAP and HTTP protocols also use these line breaks.

I measured the throughput and memory usage of different line iteration approaches. Most of them were already listed on Stack Overflow. Throughput is the time it took to crunch through the 508 MiByte dump divided by the dump size. The time was measured with the microtime() function. On the memory side I measured the peak memory usage with memory_get_peak_usage() after the iteration was done. Not perfect but I also tested the memory usage during iteration: practically the same values but the measurement degraded the throughput significantly. So we stick to the peak memory usage after iteration.

All tests were performed 11 times and the result is the average of test runs 2 to 11. The first run only served to load the entire file into the kernel read cache. Therefore the result of the first run is discarded. I tried to eliminate IO factors here since they can vary extremely if your data comes from a normal hard disk, an SSD or directly from a network connection.

All tests are online at GitHub. Take a look for any details not mentioned here.

Facts first

The results are grouped by the use case. First the results of file iteration then the results from memory buffer iteration.

Performance of iterating over the lines of a file

Details on the different approaches:

I subtracted the buffer size from the peak memory usage in the memory buffer tests. This leaves only the memory usage of the iteration (that's what's interesting, right?).

Performance of memory buffer iteration

Details on the different approaches:


  • If you don't care if the line break is still at the end of the line fgets() is the clear winner. No matter if you read directly from a file or a memory buffer.
  • If you can't tolerate the line break stream_get_line() seems to be quite fast but has some quirks.
  • Other approaches excel in some special cases:
    • fscanf() is fast for simple line based data. Configuration files, simple YAML, headers, that kind of stuff.
    • strtok() is fast when you don't need blank likes (it skips them).
    • strpos() consumes almost no memory. Useful when working with very large buffers.
    • preg_split() excels at consuming memory. :)

Source and details of different approaches

fgets() from file

$fd = fopen(…, 'rb');
while( ($line = fgets($fd)) !== false ) {
    // do something with $line
  • Works with "\n" or "\r\n" line breaks
  • Line breaks still appended to $line
  • Using rtrim() to get rid of line breaks reduces throughput to about 47% (see "fgets with rtrim" in chart above). In that case it's faster to use stream_get_line().

stream_get_line() from file

$fd = fopen(…, 'rb');
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
    // do something with $line
  • Works only with the specified kind of line break (above "\r\n"). Most protocols define exactly one line break so not a problem in many situations. When using local files it might be a good idea to detect what line break is used on the first line.
  • Reads only lines up to a maximal line length (second argument). 0 represents the default value of 8192 Bytes. Many line based protocols also define a maximal line length. For mails RFC 5322 defines it to be 1000 Bytes.
  • You can search for more complex "line endings" with this function. For example when parsing a mail you can search for the next MIME boundary instead of iterating all the lines until you found it. fgets() can't do that. But the maximal line length might spoil the fun here.

fscanf() from file

$fd = fopen(..., 'rb');
$size = fstat($fd)['size'];
    $matches = fscanf($fd, "%[^\r\n]");
    if ($matches === false)
    $line = $matches[0];
    // do something with $line
  • $line will be null for empty lines.

Not really useful for raw line iteration but fscanf() can be useful for simple line based text data like configuration files. For example take a simple config file:

key: "value"
key: "value" # comment
# comment

Kind of simplified YAML without nesting. fscanf() parsed it with 77 MiByte/s (see file_conf_fscanf.php). fgets() followed by a simple preg_match() got about 42 MiByte/s (see file_conf_regexpr.php). Both with a peak memory usage of 0.25 MiB.

For simple data fscanf() can be quite fast. Note that you'll not find useful documentation in PHPs manual. The only useful documentation of fscanf() I know of is the Linux man page or the C language specification.

fgets() from memory buffer

$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
while( ($line = fgets($fd)) !== false ) {
    // do something with $line

strtok() from memory buffer

$data = file_get_contents('../mails.txt');
$line = strtok($data, "\r\n");
while ($line !== false) {
    $line = strtok("\r\n");
    // do something with $line
  • Works with "\n" and "\r\n" line breaks.
  • Empty lines are ignored (strtok() skipps empty tokens). That pretty much eliminates this option for protocols where empty lines are important (e.g. HTTP, mail).


$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
    // do something with $line

strpos() from memory buffer

$data = file_get_contents(...);
$pos = 0;
while ( ($line_end_pos = strpos($data, "\r\n", $pos)) !== false ) {
    $line = substr($data, $pos, $line_end_pos - $pos);
    // do something with $line
    $pos = $line_end_pos + 2;
  • Works only with one kind of delimiter or line break. But can be easily extended to any kind of delimiter.
  • Almost no memory overhead. If you can't afford to duplicate the memory footprint of the buffer this is the way to go.

preg_split with foreach from memory buffer

$data = file_get_contents(...);
foreach( preg_split('/\r?\n/', $data) as $line ) {
    // do something with $line
  • Easy for small data sets but the worst you can do for large ones.

Closing remarks

Again this is just a microbenchmark. Results can vary greatly depending on what you're actually doing. Most problems aren't as simple as pure line iteration… especially since the benchmark doesn't do anything with the lines. A very clever compiler would optimize the entire thing away.

Remember that the stream based approaches (e.g. fgets() or stream_get_line()) can use the full power of PHPs streams. They can read data from pretty much everything and you can combine them with filters to shovel e.g. base64 decoding or decompression into internal PHP routines. Haven't tested that though.

At the end just a small thing to put the throughput of PHP into perspective. The fastest value I measured was 261 MiByte/s. The line count command pv test-data.txt | wc -l got a throughput of 1.6 to 1.8 GiByte/s (pv is like cat but displays a progress bar and some statistics). So if you really need performance don't bother with PHP. Grab C, C++ or D and get going. :)

Touch table blob detection - normalized spider algorithm


It has been quite a while since I wrote anything technical. Or in general about the stuff I actually do. Well, this post is about stuff I'm currently working on: a blob detection program for an old touch table of my university. We also have a MS Surface table here but I don't really do windows programming for fun any more. Maybe I should try booting the MS Surface table from a Linux live USB stick…

A, well, the old touch table was assembled a few years ago from students I unfortunately never met (before my time). In there are infrared LEDs that flood the table with infrared light as well as a camera that permanently films the surface of the table (from below that is). When a finger touches the table the surface and the finger reflect additional infrared light. This light is then captured by the camera. The purpose of the program I'm currently working on is to detect the infrared light blobs that belong to fingers.

Touch table from below without and with a hand on it

The small figure shows what the touch table looks like for the camera. The left side shows the "background". Basically this picture is what the camera always sees. No mater what is above the table each pixel is at least as bright as in this "background". It's actually the mean image of 60 consecutive images. This way the noise of the camera is filtered out. The right side shows a normal input image with 5 fingers touching the surface and a lot of noise in it.

Just in case someone notices: These images were taken when the table was partially disassembled. I played around with different cameras (another story) and this was a more or less comfortable way to do it. That's the reason why the beamer is almost in the center of the image and only a part of the actual surface is visible. But this was enough to test and develop by so I used a few hundred captured frames as sample material.

The ultimate goal is to throw the right picture into the detector program and getting the coordinates of the 5 fingers in return. Well, for a human this is an easy task. Unfortunately I can't afford to hire someone always watching the table from below and entering the coordinates into a computer with about 17ms delay. So I have to teach computers to do it for me. Turns out our brain does a lot of magic when we see the 5 fingers and it's a lot harder to teach a computer to do something similar. It took me the better part of the last 1½ weeks.

I already tried the same thing on the same table about 1½ years ago. Back then with more "established" approaches. That is filtering the image with a Gaussian blur or doing some temporal filtering (e.g. only count fingers that are there for 5 consecutive frames). These techniques do work but bring in their own set of problems. I never got the Gaussian blur fast enough on one core to filter 30 frames per second on a 640×480 video. And the temporal filtering makes fast movements and short touches almost impossible to detect.

But back then I primarily failed because of two problems:

  • The touch table is not uniformly lit and not uniformly sensitive. When someone touches the table the camera see different brightness values depending on where you touch it. This makes it very hard to use a simple brightness threshold to detect a finger. For example in one area pixels only get brighter than 140 if someone touches the table. In other areas this value can be quite a bit off (e.g. 100 or 190).
  • The arm itself is actually quite as bright as the fingers. Looking for the center of bright areas will also detect the arm as a finger, albeit a gigantic one. In the worst case it almost absorbs the fingers.

Since this first failed attempt I had some ideas to tackle at least the first problem. I thought I could avoid the second problem I wasn't that lucky. So I had to solve it. I tried quite a few different ideas. For example filtering for specific brightness increases followed by matching brightness decreases (I think one can call that the second spatial derivative). And about 2 hours ago I finally succeeded in that endevour. So here is how it works right now…


I need to determine some characteristics of the table before actually detecting stuff. You need to know your enemy, right? On my first attempt this only involved two steps:

  • Capture a background image
  • Figure out what pixel on the camera image corresponds to what position of the table surface (haven't done that yet)
The background, the touch map and the differnce between them

To get hold of the differences in sensitivity on the table I added a third characteristic: A map of how strong the brightness changes when someone touches the table. I simply called it the "touch map" (well, naming things is difficult). Basically it's recording the maximal brightness of each pixel while you touch the table every where. With that data we know the brightness value when no one touches the table (the background) and when someone touches it (the touch map). And these values can be different for each pixel.

With these two per pixel values we can "normalize" the brightness of any pixel into a uniform one. If a pixel is a bright as the background the value will be 0 and if it is as bright as the touch map it will be 1 (or 255 for 8 bit integers). Any further code can work on this "normalized" brightness and is independent on the actual sensitivity of the table.

And the sensitivity can vary quite a bit. The second figure shows the background and the touch map as well as the difference between them. I only touched half the visible surface and you can see that area nicely on the rightmost image. Some variations in the sensitivity come from my lazy "touching the table every where". I did this a bit to often so it's not the best touch map but it works.

As extra fine tuning the touch map is blurred slightly. Each pixel is taken as the mean value of a 3×3 pixel square with the pixel in the center. This is fast and gives a nice 1 pixel "safety margin" around especially sharp edges in the touch map. E.g. areas surrounding the white "blind spot" of the table reduce the sensitivity around them. This prevents the normalization from amplifying noise around these areas into insanely bright spots.

I also added a "touch range threshold". A pixel needs a meaningful difference between its background and touch map value. If this difference is below e.g. 10 we don't do any further stuff with this pixel. This prevents the normalization from amplifying noise in regions that were never touched.

Background with noise

After many noise filtering experiments (simple blurring, eliminating specific 1 and 2 pixel patterns, etc.) I settled for a somewhat contra intuitive approach. Usually the background is captured by calculating the mean brightness of several consecutive frames for each pixel. In this mean brightness the camera noise eliminates itself (it's evenly distributed in time). This gives a nice and noise free background.

However when you subtract this mean background from an input frame you get the difference from the background… and a lot of noise. I found myself spending more and more time on filtering that noise. Out of curiosity I started to capture minimal and maximal noise values for the background as well. For each pixel that is the smallest (min) and largest (max) brightness observed while capturing the background. As it turns out this noise is not evenly distributed over the entire table (spatial). And anyway I'm only interested in pixels that are above the noise level of the camera. Otherwise it's very difficult to differ between noise and a valid increase in brightness.

End of story: Right now I'm just capturing the maximal observed brightness when creating the background. This way the background also contains the maximum noise level. Now when we subtract the background from an incoming frame we get a simple noise filter for free. It's not great but actually does the job without resorting to temporal filtering and is as fast as it can get.

The actual blob detection

Stages of the detection pipeline

Whenever the camera sends a new frame it runs though the following pipeline:

  • Original input frame: The stuff from the camera.
  • Subtract background: Remove the brightness we know is always there as well as the noise. Note that the brightness in this frame of the figure is pushed by 32 to make the actual changes more visible.
  • Normalize difference: Use the touch map to map the difference to a brightness uniform across the entire table.
  • Simple blur: Do a simple 3×3 mean blur on the uniform brightness values. This step actually served a different purpose for another idea (smoothing the brightness gradients so slope, high point and low point detection would work better). However it also serves well to suppress some hardcore 1 and 2 pixel noise as well as occasional 1 pixel spikes from the normalization (pretty rare thanks to the same blur on the touch map itself).
  • "Spider" amplification… sorry for the lack of a better name. This step detects pixels that belong to fingers and suppresses pixels that belong to other stuff (e.g. arms).

The figure shows the frame after each of these steps. The order of the images in the figure is left to right and top to bottom.

The "spider" amplification is the part I just finished several hours ago. For each pixel the program looks into 4 different directions. Like the legs of a spider, hence the name. Ok, a spider has 8 legs but this is what I thought of when first visualizing the algorithm. We go e.g. 8 pixels in each direction and compare the brightness value there with the brightness of our own pixel. If the brightness difference between the outer and center pixel is above a threshold (e.g. 75) the "leg" is counted. If it's below the "leg" is ignored.

The new brightness value of the center pixel is then calculated as the mean value of all counted "legs". This alone gets rid of almost the entire arm and leaves the fingers very bright. However it also generates some "ghost" artifacts. Like 4 different versions of the image just a bit darker and offset into each of the 4 directions. This however can be suppressed by only counting center pixels with at least 3 "legs" since the ghosts are created by center pixels with only one intact "leg".

The distance or "leg" length has to be configured to match the maximum blob size. Right now I got the best results with a value of 8 but this is very dependent on the size of the table and resolution of the camera. The last image (lower left) in the pipeline figure shows the results of the spider amplification with a distance value of 8.

Further stuff

All parts of the detector program are quite simple right now (code wise). Therefore I have good hopes to achieve 60 fps on a 640×480 video in realtime. The algorithms themselves at least look quite well suited for SIMD or GPU parallelization. However I would prefer to have it run on a single CPU core. Less data transfer and keeps the GPU and most parts of the CPU free for the game. Yes, the entire thing is only to make the touch table usable for a specific game I have in mind. Let's hope it works out. :)

Now the sun has risen again. It's time to go to bed.

Thanks for scrolling down all the way, it can get quite lonely here…
Anyway, looking for older entries? Want to know more? Take a look at the archive.