It's been a long time since I wrote anything here. During the last 2 years of study I haven't had much time to finish anything. Well, now I have the time. One of the first things on the list is an optimized mail parser completely written in PHP (so you don't need to install any extensions). Since mails are a line based format the parser will have to iterate over lines quite often. So it's worth to do some microbenchmarks on that.
I'll focus on two use cases:
- Iterate over lines from a PHP stream. For example an opened file (MBox, mail dir, …), a network connection (HTTP, IMAP, NNTP, …) or any other PHP stream.
- Iterate over lines of a memory buffer. The data is already loaded into one large string. This can be useful to reduce method call overhead. The parser can be called only once for the entire buffer instead of once for each line.
Please be aware that this is a microbenchmark! Results can very quite a bit depending on the rest of your code. The values here are a starting point for your own function or library microbenchmarks not a definitive answer.
Test setup
All tests were performed with PHP 5.4.6 on Linux. The test data was a 508 MiByte dump of the digitalmars.D newsgroup with CRLF ("\r\n"
) line breaks and 12 930 435 lines (on average 41.2 Bytes per line). The NNTP, IMAP and HTTP protocols also use these line breaks.
I measured the throughput and memory usage of different line iteration approaches. Most of them were already listed on Stack Overflow. Throughput is the time it took to crunch through the 508 MiByte dump divided by the dump size. The time was measured with the microtime() function. On the memory side I measured the peak memory usage with memory_get_peak_usage() after the iteration was done. Not perfect but I also tested the memory usage during iteration: practically the same values but the measurement degraded the throughput significantly. So we stick to the peak memory usage after iteration.
All tests were performed 11 times and the result is the average of test runs 2 to 11. The first run only served to load the entire file into the kernel read cache. Therefore the result of the first run is discarded. I tried to eliminate IO factors here since they can vary extremely if your data comes from a normal hard disk, an SSD or directly from a network connection.
All tests are online at GitHub. Take a look for any details not mentioned here.
Facts first
The results are grouped by the use case. First the results of file iteration then the results from memory buffer iteration.
Details on the different approaches:
I subtracted the buffer size from the peak memory usage in the memory buffer tests. This leaves only the memory usage of the iteration (that's what's interesting, right?).
Details on the different approaches:
Results
- If you don't care if the line break is still at the end of the line
fgets()
is the clear winner. No matter if you read directly from a file or a memory buffer.
- If you can't tolerate the line break
stream_get_line()
seems to be quite fast but has some quirks.
- Other approaches excel in some special cases:
fscanf()
is fast for simple line based data. Configuration files, simple YAML, headers, that kind of stuff.
strtok()
is fast when you don't need blank likes (it skips them).
strpos()
consumes almost no memory. Useful when working with very large buffers.
preg_split()
excels at consuming memory. :)
Source and details of different approaches
fgets() from file
$fd = fopen(…, 'rb');
while( ($line = fgets($fd)) !== false ) {
// do something with $line
}
fclose($fd);
- Works with
"\n"
or "\r\n"
line breaks
- Line breaks still appended to
$line
- Using
rtrim()
to get rid of line breaks reduces throughput to about 47% (see "fgets with rtrim" in chart above). In that case it's faster to use stream_get_line()
.
stream_get_line() from file
$fd = fopen(…, 'rb');
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
// do something with $line
}
fclose($fd);
- Works only with the specified kind of line break (above
"\r\n"
). Most protocols define exactly one line break so not a problem in many situations. When using local files it might be a good idea to detect what line break is used on the first line.
- Reads only lines up to a maximal line length (second argument).
0
represents the default value of 8192 Bytes. Many line based protocols also define a maximal line length. For mails RFC 5322 defines it to be 1000 Bytes.
- You can search for more complex "line endings" with this function. For example when parsing a mail you can search for the next MIME boundary instead of iterating all the lines until you found it.
fgets()
can't do that. But the maximal line length might spoil the fun here.
fscanf() from file
$fd = fopen(..., 'rb');
$size = fstat($fd)['size'];
while(true){
$matches = fscanf($fd, "%[^\r\n]");
if ($matches === false)
break;
$line = $matches[0];
// do something with $line
}
fclose($fd);
$line
will be null
for empty lines.
Not really useful for raw line iteration but fscanf() can be useful for simple line based text data like configuration files. For example take a simple config file:
key: "value"
key: "value" # comment
# comment
Kind of simplified YAML without nesting. fscanf()
parsed it with 77 MiByte/s (see file_conf_fscanf.php). fgets()
followed by a simple preg_match()
got about 42 MiByte/s (see file_conf_regexpr.php). Both with a peak memory usage of 0.25 MiB.
For simple data fscanf()
can be quite fast. Note that you'll not find useful documentation in PHPs manual. The only useful documentation of fscanf()
I know of is the Linux man page or the C language specification.
fgets() from memory buffer
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = fgets($fd)) !== false ) {
// do something with $line
}
fclose($fd);
strtok() from memory buffer
$data = file_get_contents('../mails.txt');
$line = strtok($data, "\r\n");
while ($line !== false) {
$line = strtok("\r\n");
// do something with $line
}
- Works with
"\n"
and "\r\n"
line breaks.
- Empty lines are ignored (
strtok()
skipps empty tokens). That pretty much eliminates this option for protocols where empty lines are important (e.g. HTTP, mail).
stream_get_line
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
// do something with $line
}
fclose($fd);
strpos() from memory buffer
$data = file_get_contents(...);
$pos = 0;
while ( ($line_end_pos = strpos($data, "\r\n", $pos)) !== false ) {
$line = substr($data, $pos, $line_end_pos - $pos);
// do something with $line
$pos = $line_end_pos + 2;
}
- Works only with one kind of delimiter or line break. But can be easily extended to any kind of delimiter.
- Almost no memory overhead. If you can't afford to duplicate the memory footprint of the buffer this is the way to go.
preg_split with foreach from memory buffer
$data = file_get_contents(...);
foreach( preg_split('/\r?\n/', $data) as $line ) {
// do something with $line
}
- Easy for small data sets but the worst you can do for large ones.
Closing remarks
Again this is just a microbenchmark. Results can vary greatly depending on what you're actually doing. Most problems aren't as simple as pure line iteration… especially since the benchmark doesn't do anything with the lines. A very clever compiler would optimize the entire thing away.
Remember that the stream based approaches (e.g. fgets() or stream_get_line()) can use the full power of PHPs streams. They can read data from pretty much everything and you can combine them with filters to shovel e.g. base64 decoding or decompression into internal PHP routines. Haven't tested that though.
At the end just a small thing to put the throughput of PHP into perspective. The fastest value I measured was 261 MiByte/s. The line count command pv test-data.txt | wc -l
got a throughput of 1.6 to 1.8 GiByte/s (pv
is like cat
but displays a progress bar and some statistics). So if you really need performance don't bother with PHP. Grab C, C++ or D and get going. :)