Measurements of system call performance and overhead

January 5th, 2017 operating-systems, performance, programming

I was kind of bored for once and figured I could clean up an old microbenchmark and kick the results out of the door. This is gonna be another very technical post for programmers. Be prepared. 😉

A while ago I came across a paper about "exokernels", very light-weight operating system kernels that leave most of the work up to libraries (Exokernel: An Operating System Architecture for Application-Level Resource Management). The paper is from 1995 and among other things they implemented high performance system calls.

I always heard system calls are expensive but I never saw any real numbers or measurements. Only the stuff the C standard library was doing with fread() and fwrite(): Batching I/O to minimize system calls. But system calls evolved a lot since the x86 32 bit days and today we have a dedicated syscall instruction for them.

This got me wondering: How fast are system calls on Linux today compared to normal function calls? In the paper they compared a function call with no arguments to the getpid() system call. That system call just does the minimum of work: Switch into kernel mode, read one variable and return its value. Sounds simple, so I wrote a set of small microbenchmarks and did the same on my Linux notebook.

…an entire stack of CPU chips. — When I asked my brother for some old CPUs he gave me this…

Back when reading the paper I just ran the microbenchmarks on my own notebook (an old Intel Core 2 Duo T6600 from 2009). But right now I'm with my brother and he owns quite a few computers and I asked him to run the microbenchmarks on some of them. We also run the benchmarks on some older CPUs (see the photo :)). Anyway, it gave me a nice perspective of how the performance of system calls evolved over time.

Mind you, I just took whatever PC was close by (no idea how fast or slow) as well as a few older CPUs. I was curious how system call performance changed over the years on x86_64 and don't really care about what CPU is faster or slower. Don't take these numbers and say that CPU X is faster than CPU Y. It just doesn't make sense. System calls are one of many factors that make up the performance of a CPU and software but by no means a defining one. So don't take the numbers to serious. This is just for fun after all. 😉

Before we get to the numbers I have to explain a bit about how Linux optimizes some system calls: On x86 64 bit system calls are usually made through the syscall instruction. You put the arguments into the proper registers and the syscall instruction then transitions into kernel mode and calls the proper kernel function. This incurs some overhead since the CPU has to switch into a different address space and this might need updates to the TLB, etc.

For some system calls the Linux kernel provides optimized versions via the vDSO. In some cases these optimized versions don't need to transition into kernel space and thus avoid the overhead. By default the C runtime on Linux (glibc) uses these optimized functions automatically whenever there is one available. I'm interested in how both of these mechanisms perform.

On to the numbers. Just for a bit of context: Reading a value from the main memory takes about 100ns, see Latency Numbers Every Programmer Should Know.

Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time. The vDSO implementation of the getpid() system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call. On the Intel Celeron D 341 from 2004 the a system call via the syscall instruction was about 25 times slower than a system call via the vDSO. On the Intel Core i7-4790K from 2014 it's only about 12 times slower. For me I'll use 10 times slower as rule of thumb for modern CPUs and 25 times for older CPUs.

In case you're wondering about the details:

All benchmarks were run on Linux Mint booted via USB stick (Linux 4.4.0-21-generic)
The benchmarks execute the function call or syscall 10 million times in a loop.
The benchmark is run 10 times. The average of those runs is then divided by 10 million to get the time per function or system call.
The function call benchmark is compiled with GCC (I also did it by hand in assembler but the result was almost the same).
The syscall benchmark is programmed by hand in assembler using yasm.
The vDSO benchmark is a small C program compiled with GCC. When you use the "normal" C functions for system calls glibc automatically uses the vDSO.
I've put the source code and results of the microbenchmarks into a repository. Feel free to look around in case you have any questions. You can also find a lot more graphs there.

Performance of `fread()` and `fwrite()` I/O batching ¶

I was also wondering about the I/O batching the standard C library does with fread() and fwrite(). Is it really worth the effort? Or are that many function calls wasted since system calls can be pretty fast, too?

A few microbenchmarks answered that question, too. This time I measured 1 million 4 byte reads and writes. Again directly via the syscall instruction and via the vDSO. But this time also via fread() and fwrite() instead of using system calls. Here's what I got:

So, yeah, in this scenario the I/O batching definitely is worth it. It's about 10 to 5 times faster.

But keep in mind, these were 4 byte reads and writes. I took a look with strace and they got batched into 4 KiByte reads and writes. So when you're reading or writing a few KiBytes the speedup is likely not that large.

It's also interesting that the vDSO doesn't help much for the read() and write() system calls. But usually it doesn't hurt either. So maybe these system calls can't be optimized in that way.

If you're still with me: Thanks for reading. I hope my strange little experiment was interesting. 🙂

Comments

Newsfeed

#1 by Francesco October 20th, 2017 21:45

Great benchmark thanks, interesting results!
react

nice meh bad surprised confused agree disagree
- 1
#2 by fg November 26th, 2017 23:08

Thanks for sharing your strange little experiment. It was indeed an interesting read.
react

nice meh bad surprised confused agree disagree
#3 by R November 29th, 2017 15:50

Thanks for posting your results, very interesting indeed.
react

nice meh bad surprised confused agree disagree
#4 by cam January 2nd, 2018 9:41

This is fantastic, never finish learning. Thanks.
react

nice meh bad surprised confused agree disagree
#5 by liqiaoping January 6th, 2018 12:50

Great experiment and very lucky to read!
react

nice meh bad surprised confused agree disagree
#6 by Rob Bradford January 26th, 2018 16:22

Thank you for your research.

Not that it invalidates your results but getpid() is not in the vDSO but instead cached by glibc (not that it would lead to any difference in performance.)

However for anybody looking at reproducing your results they should be aware that getpid() is no longer cached by glibc starting with 2.25.

Some explanation can be found in the RH bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1469670

The list of functions in the vDSO are architecture dependent but can be found in the man page.
react

nice meh bad surprised confused agree disagree
- #7 by Stephan reply to #6 January 29th, 2018 15:12
  
  Thanks for the heads up and the link Rob! I liberated the idea to use getpid() as a test from the Exokernel paper and blindly assumed Linux would do the same. Kind of stupid from me. I should have read the getpid() man page... they even state it there.
  
  Well, that calls for some revising and a new round of benchmarking. 🙂 I think I'll use time() or gettimeofday() next as they're listed in the vDSO man page for x86-64 systems. I don't expect it to change much but that remains to be seen. With a bit of luck I can do a new benchmark in a few months time. Maybe even with some new CPUs...
  react
  
  nice meh bad surprised confused agree disagree
#8 by oussama April 4th, 2018 13:32

thank you for those information. i was really in need to read this and get some idea about what happen with hardware test
react

nice meh bad surprised confused agree disagree
#9 by Anonymus December 30th, 2018 22:16

Good experiment! I'm trying to learn about operating systems, and my most recent topic was system call latency. It was great to see some real numbers about it!
react

nice meh bad surprised confused agree disagree
#10 by Basile Starynkevitch May 30th, 2019 15:22

Do you have any mmap related benchmark? Or malloc related ones? Or pthread_mutex_lock related ones?

A big thanks for your benchmark. I am coding RefPerSys and that is why I needed them.

Basile Starynkevitch http://starynkevitch.net/Basile/
Bourg La Reine, France
basile@starynkevitch.net
react

nice meh bad surprised confused agree disagree
- #11 by Stephan reply to #10 June 1st, 2019 18:49
  
  Hi Basile,
  
  Please note that the current benchmark numbers are not what they appear to be (see comment #6 by Rob Bradford). Unfortunately I haven't had the time since then to write up a new article.
  
  Regarding mmap() or malloc(): Quite difficult. What would a meaningful benchmark of those system calls look like? The performance of malloc() depends heavily on the implementation (does it use linked lists, memory pools, etc.). If you can you should do a micro-benchmark for your target system and situation. With mmap() I guess it depends on how many memory is available right now, how many memory maps there are, and so on. I don't know the implementation though.
  
  I've read the refpersys project page. From that I guess you're primarily interested in garbage collection. In that case you might be interested in the madvise() system call. Specifically the MADV_DONTNEED flag. With that you can allocate one large memory area and invalidate parts of it after garbage collection. MADV_SEQUENTIAL and MADV_NORMAL might also be of interest. Before scanning a GC space you can switch to MADV_SEQUENTIAL (in case you do sequential scanning) and when done you can switch back to MADV_NORMAL. But I haven't uses those two flags myself yet. I've only ever written a Baker GC for a small Lisp interpreter so I don't know how well this applies to other GCs.
  
  pthread_mutex_lock() is another beast entirely. I suppose the performance will depend heavily on the contention of the mutex. As far as I know it's implemented with futexes so it should be fast if only one thread uses it. But it will get slow if many threads try to lock it. There should be some good futex benchmarks out there (I hope).
  
  I've also got some questions regarding refpersys: If every mutable object has its own lock how do you avoid deadlocks? Primarily with more complex objects that consists of many atomic objects. From the "persistent values" section I also got the impression that the language operates on a global memory model. Meaning every variable can be accessed from every tasklet. This will be quite the garbage collection challenge. Not the mention the chaotic memory access patterns and fragmentation (entailing false sharing of cache lines, etc.). Have you considered keeping most variables local to tasklets and only explicitly exchange variables with other tasklets (by transfering ownership, via message passing, etc.)? That way each agenda could run its own thread-local garbage collector and that's way simpler than a multi-threaded one. Also you could do pretty cool stuff like running a tasklet, collect its survivor variables when it's done and store those in the tasklets own private memory area. This would compact the memory for each tasklet and lower the overhread of not running tasklets. But well, I'm dreaming right now…
  
  This got a bit longer than intended. Sorry about that. It's not often that I can talk about those things. 😄
  
  Happy programming
  Stephan
  react
  
  nice meh bad surprised confused agree disagree
#12 by Michael Witten July 22nd, 2021 18:47

Thanks for sharing your knowledge.
react

nice meh bad surprised confused agree disagree
#13 by Paul Falke August 23rd, 2021 17:16

Thanks for your work! To the old write() versus fwrite() discussion. You tell that fwrite() uses a 4kByte buffer. This increases throughput, but increases latency, too. Therefore write() is the low latency solution and fwrite() is the high throughput solution. On the next level of detail you can use fflush() together with fwrite() to have the best of both worlds. On the last level of detail you have to remember that next to the userland buffers of glibc there are kernel buffers. To flush these buffers you use fsync() if it is disk io or ioctl() for network connections. Still, your work is a nice introduction in the topic.
react

nice meh bad surprised confused agree disagree
- #14 by Stephan reply to #13 August 30th, 2021 1:24
  
  That's a nice way to look at it. Thanks for the compact description Paul. 🙂
  react
  
  nice meh bad surprised confused agree disagree
#15 by Alex Petrov May 26th, 2023 10:26

Интересно было бы глянуть те же тесты на современных amd/intel/arm64.
react

nice meh bad surprised confused agree disagree
#16 by Optional June 14th, 2023 9:23

The read/write results say that the vdso doesn't really help anything (except for one case, which is worth an investigation) for those cases. So it's reallly only useful as a unified break-out for the few syscalls that can do without the context switch part*.This stands to reason because on x86_64 vdso will call syscall anyway.

Not so on 32 bit x86: There the interface is to use int 0x80, except where you can break out, or where you can use sysenter or syscall, depending on manufacturer. Which then ought to speed up all syscalls. Except that you can't really because the interfaces are muddled and yucky, after figuring out whether you're on intel or amd. That's not a job for a program, so linux does it through a vdso. Though that too is rather complicated and poorly documented if libc doesn't do it for you. And that means a non-static binary and linking your assembly with libc again. Sigh.

So it would be more interesting to benchmark on 32bit linux: Does it pay to use vdso from assembly rather than just use int 0x80?

* Which is one or two, a small handful at most. The one use case that comes to mind is from a solaris(!) report where financial software was found to call time() a few millions of times per second, so speeding that up sped up the software considerably. I consider it a defect of design: Should've just taken one time() value, run a batch of transactions, and take another time() value to confirm the batch was completed within one second, then record that value for all transactions. Rather than several time() values for each transaction, which being single-second granular, will be the same value for the bulk of the transactions anyway.
react

nice meh bad surprised confused agree disagree

Performance of fread() and fwrite() I/O batching ¶

Comments

Performance of `fread()` and `fwrite()` I/O batching ¶