A friend of mine recently worked on a search for a system that used a triplestore.
The problem was a simple one: Find all the stuff with a given string in it, even if the string isn't a whole word.
They ended up with about 5 million (id, text)
pairs, used Elasticsearch and the queries are usually faster than 1 second.
I don't know the details of what else is going on (ranking, sorting, etc.) so please don't take this a representative performance of Elasticsearch. ;)
But I asked myself:
How fast would that be if we brute-force it with grep
?
What if we write all the (id, text)
pairs out to a file, one pair per line.
Then use the grep
command line program to get all lines that contain a specific substring.
And from those matching lines we can extract the IDs. A few lines should do the trick, right?
For how much data would this be a good sweet spot of complexity vs. performance (read: The right tool for the job)?
So I wrote a microbenchmark. Because what else do you do over Christmas? :)
In case you couldn't tell already: This is going to be something only programmers can obsess over. You have been warned. ;)
The overview
Just in case you only want the gist of it and don't care about the details:
- Tested a file with 10 million lines (1.4 GiByte), each line about 150 bytes. Also tested a few other sizes but the performance didn't change much.
- The tests were run on a notebook with an Intel i7-8550U @1.80GHz with 32 GByte RAM running Linux 5.3.0
- The entire file fits into memory (is kept in the kernel page cache) so we don't have to wait for the SSD to read the data.
Now on to the data itself:
grep
can handle 2.3 GiByte/s. If you have just 1 million searchable records, each about 150 bytes long that would be about 145 MiByte of data.
grep
will chew through that in about 63ms. Pretty fast, isn't it?
sed
, awk
and mawk
are slower, but still pretty fast.
I wanted to know how fast a naive implementation of a search would be in different languages.
Please don't take this as a performance comparison.
There are many different reasons to choose a language for a project, and performance of the language itself usually isn't a terribly important one.
I usually work with different languages at the same time (use the right tool for the job) and I wanted to know what performance I can expect in a given environment. :)
- The PHP, Python and Ruby programs that search each line individually are somewhat slower. But still in the hundreds of MiBytes/s on the test system.
- The last tree programs load the entire file into one string (e.g. with
mmap
) and search it for matches. This eliminates the line iteration overhead.
A little more complicated but quite a bit faster.
If a search can take 1 second you can chew through about 1 GiByte of data in that time.
I also wrote a C programs that do the same as above.
And since multi-threadding in C is easier than in many interpreted languages I threw some threads at the problem.
Note the different scale.
- C can chew through a lot of data per second. :)
- But honestly I don't think I'll have a project anytime soon where I would sacrifice several GiByte of RAM just to keep a search index (the text file) in memory.
And as soon as the search index isn't in memory the search will be only as fast as the disk can provide the data (1.64 GiByte/s on the test systems SSD).
Note that the Linux kernel only keeps the search index in memory (in the page cache) if there is nothing more useful it could do with the RAM.
If nobody uses the search the kernel won't waste memory for it.
So much for the "short" version. If you're still interested in the details feel free to read on. :)
The test setup
All the programs, scripts and commands used are available in a GitHub repository.
I also put the raw benchmark results in there
(this includes the versions of everything used.
The benchmarks were run with the command ./bench.sh > debug.log 2> results.txt
.
To get a reproducible data set I wrote a small ruby script to generate id: text
lines.
The script uses a fixed seed so it always generates the same lines. A small sample:
http://example.com/3073701/Ybyld6r9vNai95TYXiQH4FuQV+l/CHMwP4i3jQ== qO76BFQLlj2g
http://example.com/3874850/zjQdPR3iPXe/f28d0x6D2fQONt8x j5gSmB0dQhRpZ5yjnMBU9la1Dl00wzcD2C7J
http://example.com/4659995/zLtSsvjmP6o6vYwQQm0VRTtgxM0xUbY= QvzHvFPq7/VrKDJbjoQ9xnzgUUEHP2F/RW2WbIOfnq0wgS5xNCCwF4w=
http://example.com/9674460/OGuKOpNt22+aPUfoW3XTBDOfh/q/lc8ysJpil+M= WURbI/R069glGkYLjfOizaobX5gJRUjF71pe
Pretty much base64 encoded random garbage.
Each line is on average about 150 bytes long, some a lot longer, some a lot shorter.
The 150 byte length doesn't have any deeper meaning. I just picked something.
You might have way fewer documents with a lot more text in them or way more things with just short text.
That's why I give the performance in MiByte/s. It doesn't matter that way.
lines-100k.txt | 14.4 MiByte |
lines-1m.txt | 144.9 MiByte |
lines-10m.txt | 1.4 GiByte |
lines-50m.txt | 7.2 GiByte |
lines-100m.txt | 14.3 GiByte |
I generated a few of those files with 100 thousand, 1, 10, 50 and 100 million lines. Just to see if the programs would
behave differently.
The last step was to pick some substring in those lines (e.g. "VrKDJbjoQ9xnzg") and searched for it with different tools.
Each program was executed multiple times and the results were averaged. From 10 times for the small files like lines-1m.txt
down to 2 times for lines-100m.txt
.
You can look at bench.sh for the details.
Just like grep
each tool is supposed to search a text file for lines that contain the given substring.
No matter where it is or if it's an entire word or just part of a word.
I just tested the tools I had already installed and what struck my fancy.
So don't be surprised if it doesn't include your favorite tool.
I also wanted to know how fast naive implementations of such a search would be in different languages.
That's how the PHP, Ruby, Python and C programs found their way into the benchmark.
Again, not to compare performance but rather to know what to expect in a given environment.
Measuring execution time
/usr/bin/time
was used to measure how long it took for a program to go through an entire file (logged the wall time and some other stuff).
This isn't the time
builtin of shells like bash
or zsh
but an extra program.
A small caveat here: The time
program only reports times with 10ms precision (e.g. 0.21s
).
Therefore I didn't do runs with 100k lines. Some programs were simply to fast and I got division
by zero errors. And I was to lazy to look for a tool to properly read the timing information form
the Linux kernel or write a skript for it.
All that stuff is available under /proc/[pid]
in stat
, io
, smaps
and more, see the proc
man page.
The test systems
Two systems had the honor of not running fast enough when I had the idea for the benchmark.
So they had to search text files for substrings for the next two days. :)
One PC (can't blame it, it can't run):
Linux 4.15.0-91-generic #92-Ubuntu SMP x86_64
CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
RAM: 15.6 GiByte
Memory bandwith: 15.96 GiByte/s (reported by `mbw 2048`)
Disk: 4TB WD Red "WDC WD40EFRX-68WT0N0 (82.00A82)"
Read speed: 103 MiByte/s (measured with `wc -l`)
And one notebook (no excuses here):
Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP x86_64
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 31.1 GiByte
Memory bandwith: 13.8 GiByte/s (reported by `mbw 2048`)
SSD: 1 TB Samsung "MZVLB1T0HALR-000L7"
Read speed: 1.64 GiByte/s (measured with `wc -l`)
The rest of the article only shows the results of the lines-10m.txt
run (10 million lines, 1.4 GiByte) on the notebook.
It's pretty representative of the other runs. No need to show more data that doesn't say anything interesting.
Funnily enough both CPUs performed pretty much the same. Well, the task at hand isn't exactly black magic, even for older CPUs.
Anyway, the full data of all runs is in the repository.
Assumption: All the data fits into memory
Or rather in the page cache. In case this is new for you:
When you read data from a file the operating system does something nifty.
If it has unused memory left it keeps the data in there.
The next time a program wants to read the same chunk ("page") of the same file the operating system takes it right out of the page cache.
That way it doesn't have to wait for the disk again to get the data.
Well, free memory doesn't help anyone, so might as well use it to cache some data that might help someone in the future. :)
In the benchmarks below I read the file once with wc -l
.
From then on it's in the page cache and all consecutive programs got their data straigt from there.
Thanks to that the benchmarks are not bound by the read performance of the disk.
As long as the file fits into the page cache that is.
If the file is for example larger than the memory your program is always limited by how fast your disk
can provide data (e.g. 1.64 GiByte/s on the notebooks SSD).
Results and findings
Ok, that's it for the "preliminaries". Now on to the real meat: The results and what suprised me.
Funny thing: grep
is faster with longer search strings
While testing out grep
I noticed that longer search strings seem to perform better:
Here are the individual commans:
short, 2310 MiByte/s:
$ grep VrKDJbjoQ9xnzg lines-10m.txt
medium, 4177 MiByte/s:
$ grep V6TGqGjjnmhlHRcYEq1IJCgzUNSx09bCkwJnEK lines-10m.txt
long, 5207 MiByte/s:
$ grep FMzaoLmVeEoJ3PAPERDFSO2RFEo5/mO17YTQrXz4jr0Ud9w0854q6/rcRu11AocX3vzl4q7O0f6c lines-10m.txt
At first I was a bit startled by that.
Especially given that grep
doesn't only search for simple substrings but can also search for regular expressions.
But then I realized that the longer the match the smaller the characters in a line where the match could start.
In other words: If we search for a string that is 20 chars long we can skip the last 19 chars in each line.
They can't possibly contain a match. At least that's what I think is happening here.
This might also be a general optimization for regular expressions in grep
. I haven't looked into it.
Update 2020-12-30: It's probably the Boyer–Moore string-search algorithm.
Forgot about that one. :D
In the rest of the article I quoted the performance for the smallest substring.
In my experience long substring searches are somewhat rare. But in the end that depends on your usecase.
The awk
side of things
I also wanted to test out awk
a bit.
If you've never heard about it: awk
is a command line tool for line based processing.
Perfect for processing logs, CSVs, statistics and stuff like that.
It basically executes scripted actions before processing a file, then on each line that matches a pattern and after processing a file.
It can do much more than just search each line for a substring.
And what's really cool: awk
reads the input line by line.
It doesn't have to load the entire file into memory.
So you can happily crunch gigabytes or terabytes of data all day long without worying about running out of memory.
Well, not that this is reflected in the benchmark since the entire file was in memory when those benchmarks were run. :P
I threw sed
into the chart as well because it had no where else to go and felt lonely.
GNU awk
seems to be the default version, at least on Ubunt based systems.
The performance is pretty decent and it could handle about 350 MiByte/s on the test system.
mawk
is another implementation and uses a bytecode interpreter.
I had it flagged down for a while because it seemed to be quite a bit faster than GNU awk
.
Looks like it actually is. :)
Also kind of funny: When I used the index()
function of mawk
to look for a substring instead
of a regular expression it got quite a bit slower (the "mawk index" bar).
Naive line-by-line search in different languages
As mentioned before: I wanted to know what performance I can expect from a naive line-by-line search in different languages.
This might be useful to decide if I go with a simple low-effort brute-force solution.
Or if I have to spend a bit more time on evaluating alternatives (a database, library or whatever).
It all depends on the requirements and environment of a project.
The performance numbers given here are just datapoints that help to figure out how long a simple brute-force line-by-line search is fast enough.
Again: Please don't take this as a performance comparison. Because the world isn't that simple.
I hope you don't choose the programming language you use for a project based on it's naive find-substring-in-line performance. ;)
With all that out of the way, on to the programs. The code is simple enough, here for example in Ruby:
File.open(filename) do |f|
f.each_line do |line|
puts line if line.include? search_term
end
end
I wrote the same in PHP, Python and C as well. But we'll look at the C programs later.
The nice thing of this approach is that we only need the memory for one line at a time.
No matter how large the entire file is we won't run out of memory.
As long as the entire file isn't just one big line that is. :D
Pretty decent performance for just a few lines of code.
Probably fast enough for a few hundred MiBytes of data.
But for larger data sets I would probably execute a grep
command if the code only needs to run on Linux.
Search the whole file instead of each line
While playing around with the strstr()
function in C I noticed that the function by itself pretty darn fast.
So wouldn't it be great if we could eliminate the line-iteration overhead and our code could spend all its time in strstr()
?
What if we had the entire file in a single string.
Then we could search this one large string with consecutive strstr()
calls until we've found all the matches and reached the end of the string.
This is more like searching in a file with a text editor than searching each line individually.
Each time we find a match we could search for the start and end of the current line and print the entire line (just like grep
does).
Then our code wouldn't have to loop over each line, wasting instructions and CPU time doing so.
Might be faster but it would no longer be easy to search for multiple substrings in the same line.
Also the code is more complicated than before, e.g. see ruby2-all-index.rb.
Another slight problem is that we would need the entire contents of the text file in one huge string variable to do that.
But reading several hundred MiBytes kind of sucks: We would go over the entire file just to load it into memory and then go over that memory again to search it.
Well, somewhat obscure (depending on your background) operating system knowledge to the rescue! :)
Operating systems can "map" files into memory.
Basically the operating system reserves a part of your memory (address space to be more exact) that is as large as the target file.
When you access that memory at a given place the operating system automatically reads the corresponding chunk of data directly from the file.
Meaning you can read data from a file just by looking at the right memory address.
And big surprise: A string is just another (usually small) part of your memory.
So with memory mapping we can use the entire file as one large string without having to read it first.
The operating system reads the parts that are used as soon as we access them.
There's more to memory maps than that.
For example the operating system can easily evict seldomly used data chunks and use the memory for more important purposes.
If need be it can always read the data chunks from the disk again.
This can be useful when some other programs quickly need more memory (e.g. when you open a particularly "modern" website in your browser).
Anyway, I gave it a spin and lo and behold now we're talking. :)
PHP and Ruby really shift up a gear here. Easily achieving several times their previous performance.
Depending on the usecase this might be complexity well invested.
Python is pretty interesting here: It has an explicit mmap()
function that maps a file into memory.
The 2nd Python program uses that but otherwise does a line-by-line search just as before.
Just with that it's a bit faster, but this is probably not because of the memory mapping itself.
Usually reading lines from a file in Python returns an str
for each line.
Those are encoding-aware and might do some extra stuff for that.
Reading lines from a memory mapped file tough returns bytes
. Those are just raw data without any encoding.
When adding mmap()
I also changed the search from str
to bytes
and I suspect that this causes the difference in processing speed.
Another note about PHP and Ruby:
Both languages don't expose an explicit way to memory map a file.
Of course there are extensions for that but I wanted to keep it simple.
Instead they both have functions to read the entire file into a string (file_get_contents()
and File.read()
).
I hoped those use the C function fopen()
internally because that in turn uses mmap()
and maps the file into memory instead of reading it chunk by chunk.
This worked out in my micro-benchmark here but I wouldn't count on that behaviour for a real project without further research.
Especially if the file is larger than the systems memory. ;)
Going faster - C and multithreading
It might suprise some people but I still write quite a bit of code in C.
Some tasks are simply easier done there than in other languages.
Naturally I also wanted to know how fast the above programs would be when written in C.
Oddly enough threadding is something I find way easier in C than e.g. in Python or Ruby.
No need to worry about the global interpreter lock (aka GIL), garbage collector and so on.
We just let each thread search a slice of the data, e.g. 4 slices for 4 threads.
All previous benchmark results were still quite a bit away from the systems memory bandwith of 13.8 GiByte/s (as reported by mbw 2048
).
That is a very rough estimate of how fast the CPU can read from and write to RAM.
Maybe multithreadding will help us getting closer to that limit.
Sorry, unfortunately the sky isn't the limit for a CPU. :P
Well, it turns out that i7-8550U can read more than 13.8 GiByte/s. :D Incidently the PC system peaked at 4 threads with 16.8 GiByte/s.
Turns out that if you really need performany C is the way to go. By a wide margin.
With 4 threads the test system could search through 1 GiByte of data in just 65ms.
I've seen a lot of SQL queries that operated on a lot less data and took a lot longer.
But if that makes sense depends on your project and environment (as usual).
The downside is the complexity of the task in C.
You have to be careful to properly slice the data and how to put a zero-terminator at the end of each slice (strstr()
needs one).
Not as easy as writing a hand full of lines in Ruby. But not particularly difficult if you're fit in C.
You can look at c5-threaded-strstr.c for more details or feel free to just ask.
Putting the pedal to the metal
Ok, I didn't.
To utilize the maximum memory bandwith of modern CPUs you have to use vector instructions to fetch data from memory into vector registers.
Not to mention that we can use vector instructions to scan a lot of characters in parallel.
More details about that (and much more) are in Agner's optimization manuals.
But a fair bit of warning, this is pretty advanced and detailed stuff over there. ;)
As much fun as it would be to write an AVX2 assembler version my motivation didn't cary me that far… yet.
Honestly I can envision myself using and maintaining the multithreaded C version in some circumstances.
But an AVX2 version… not very much. But well, who knows what the future will bring. :D
When the data is a tight fit
In all previous runs the entire data easily fit into memory.
I wanted to see how far I could take this until we get limited by the read bandwith of the drive on which the file is stored.
My PC has 15.6 GiByte RAM (as reported by the free
program), so I did a run on it with 100 million lines (14.3 GiByte).
I didn't use the notebook because I was to lazy to generate a 31 GiByte test file. :P
This run only contains some of the programs that can handle files that are larger than the memory.
Either by iterating line-by-line or by explicitly memory mapping the file.
While all the data might still fit into memory I didn't want to risk thrashing.
grep
and the search-by-line C program are straight out limited by the read speed of the disk (104 MiByte/s).
Where we previously had hundreds of MiByte/s or even several GiByte/s we're now down to whatever the disk can provide.
Remember, this one chart shows benchmark results from the PC.
There the file was stored a HDD (104 MiByte/s), not an SSD like on the notebook (1.64 GiByte/s).
It starts to get interesting when the first program runs that explicitly uses memory mapping.
The first time it still reads the entire data from disk.
But after that it and all the following programs (also using memory mapping) get the data from the page cache.
I'm not sure why that is. But given that this is a corner case I didn't dig any deeper.
I can't really see myself wanting to use pretty much all of my memory as a cache for a substring search. ;)
A bit of perspective
As much fun as it is to optimize the brute-force approach… at some point it will no longer be the right tool to solve the problem.
If your data is just a few MiBytes in size, or even a few hundred MiBytes, the brute-force appraoch is quite attractive.
You only need to write the file with all the data and execute a grep
command when someone does a seach.
All in all just a few lines of code.
Pretty nice complexity to performance ratio I think :)
Personally pretty much all the projects I did for customers were in that category (e.g. entire database SQL dump of a few hundred MiBytes).
So for me those performance numbers are actually kind of relevant. :o
But sometimes you have a lot more data, and at some point brute-force just isn't viable anymore.
Either because the text file gets to large or you don't want to waste the memory and I/O bandwidth.
For those situations other tools are better suited to get the job done.
Databases have indices for a reason (but I don't know a good index structure for substring search).
In that situation my first approach would be:
Maintain a list of unique words in the searchable data.
Use that for an autocomplete on the client side so the server only ever gets full words (not substrings).
Then use a word based search on the server (a simple Python dict
lookup or a hash table, a text index in SQL, Elasticsearch, whatever).
But again: Only if the naive approach would be to slow or large.
After all a dozzen lines of code are quickly thrown away.
And sometimes the naive approach might carray you further than you think.
In the end it isn't about which solution is the "best", but instead about which solution is a better fit for a given problem in a specific environment.
And I hope this blog post at least somewhat hinted at the boundries of the brute-force search method.
That it showed the complexity-to-performance sweet spot for this approach: When it can be used and when it stops being useful.
If anyone is reading this far: Wow, I'm amazed at your endurence. Take a cookie and enjoy the holidays. :)