Arkanis Development

Styles

Can you host Reddit on a Raspberry Pi?

Published

A few years ago I read somewhere that you can probably run a Reddit-sized forum on a Raspberry Pi. It was a Reddit comment and I had my doubts about that. But a few weeks ago I watched What if you tried to print Wikipedia? and it mentions that the English Wikipedia receives about 100 edits per minute (1.7 edits/sec). For some reason that really surprised me (I expected way more) and brought me back to that Reddit on a Pi question from years back. So I gave the "spherical cow in a vacuum" version of that scenario a go.

To my surprise it actually somewhat checked out. At least when you ignore little details like images, videos and everything else except reading and posting posts and comments. Also I assumed you would use some read-cache like Varnish and our Raspberry Pi only get the requests not in that cache.

Under those (and a lot more) assumptions I got about 6000 - 7000 req/sec on a Raspberry Pi 4b with about 150% - 200% CPU load. Yes, those are thousands of requests per second. 1 in 100 requests inserted a new post (basically executes an INSERT statement). 99 in 100 query all the posts of a random topic (a SELECT statement). The whole thing was kind of funny and maybe even interesting enough to write about. So here we go.

The Raspberry Pi I used for the experiment. I just wanted an image here.

Setting expectations

This is just me having some fun here. Don't expect nice graphs or accurate measurements. I just wanted to know what order of magnitude of requests per second a Raspberry Pi 4b could handle. So my highly professional measurement procedure was "look at htop for 10 seconds and write down the average". Well, that and measuring requests per seconds with a benchmarking tool because that's hard to eyeball. That's the level of rigor for this post. You have been warned. ;)

As any good grown-up software developer I started by picking the toys I wanted to play with: C and SQLite. And, as is customary today, I also came up with a good story to justify my decision:

Software spends a lot of time (and code) on shuffling data around between various components. HTTP implementation, webserver, backend logic, database, etc. all have their interfaces and someone needs to glue that stuff together. Sometimes this is just calling a few function or repackaging data into other data structures. But in the case of most databases this actually involves serializing queries and data and shipping that whole stuff to and from another process (the database). If the database runs on the same machine that is. Usually there's also a lot of management involved to keep track of everything in flight and not going crazy while doing that. How about we just don't do any of that this time?

We want to process HTTP GET and POST requests. HTTP is just a (somewhat) simple TCP protocol (at least the basics, "spherical cow in a vacuum", remember?). And TCP is handled via the operating systems Socket API. This is a C API, so that means we'll code in C.

For the database we want something that doesn't do serialization. In fact it should just run directly in our own process and just do any I/O right there. SQLite does that, so SQLite it is. Mind you, I never used the SQLite C API before.

Neat story, no?

Step 1: Looking for a benchmark tool

I think I'm to lazy to write much about this part. Long story short: I ended up using wrk. It was the fastest from a bunch of tools I looked at, easy to compile and use. It also has LuaJIT support to customize the requests you send. Which I only realized later but came in quite handy. And the source code looks adequately state-machiny, which to me makes it look like someone actually knew what they were doing.

To get a baseline I wrote 3 very simple "Hello World!" webservers and took the benchmark tool that could squeeze the most requests per second from those servers.

Not something I would call a web server under normal circumstances, but "spherical cow in a vacuum" again. ;)

I did that on my PC, an 8-core AMD Ryzen 7 5700X running Linux Mint 21.1 (kernel 6.5.0-21-generic x86_64). No Raspberry Pi stuff yet. But still some numbers to give you an idea what a single-threaded "Hello World!" webserver can handle:

Server req/sec Traffic Server CPU util.
node.js ~45k ~7.6 MB/s 100%
consecutive accept() loop ~40k ~5.8 MB/s ~25% - ~80%
eventloop with epoll ~60k ~5.5 MB/s 100%

Note that 100% CPU utilization means that just 1 CPU core runs at full utilization. Even if those are just dummies to select a benchmark tool I was pretty impressed by the node.js performance. Especially since it's the only one that actually parses the HTTP stuff. The other two just throw a string back at you and basically only masquerade as HTTP servers.

On the other hand the ~40k req/sec of a very very simple endless loop of blocking accept(), read(), write(), close() calls also surprised me. Yeah, it's wonky, but you can write it in just a few minutes (that includes the time to look stuff up in the man pages). That complexity-to-performance ratio is hard to beat.

Misc. details:

Step 2: Fooling around with SQLite

With the HTTP and benchmarking side covered, it's time to switch toys to SQLite. I simply started with SQLite In 5 Minutes Or Less in the documentations "Programming Interfaces" part and went from there. And I have to admit, the SQLite documentation is quite excellent. Concise and well written.

First stop: Create a database and write a program that fills it with some test data (100k posts will do). That's the extent of our "spherical cow in a vacuum" version of Reddit:

CREATE TABLE posts (
    id       INTEGER PRIMARY KEY,
    topic_id INTEGER,
    body     TEXT
);
CREATE INDEX posts_topic_index ON posts (topic_id);

Every topic is filled up with 5 - 1000 posts. For the post text I took the first 2772 characters from "Lorem ipsum" and each post takes the first 57 - 2772 bytes from that. Not very original and the distribution is probably totally off, but I just wanted some data to play around with. Incidentally that came to about 175 MB of data (but I might remember that wrong).

Generating those posts took around 2 seconds, so ~50k inserts/sec. But all in one big transaction. And that transaction part worried me, but more on that later.

Serious thoughts aside, the most complicated part of that program was choosing the PCG random number generator I wanted. Of course I went with the "Minimal C Implementation" and copied the whole ~7 lines of it. Well, that, and using the neat ldexp() trick from Generating doubles to get random numbers in the range I wanted.

Now we have some test data and a rough understanding of how many INSERTs we can throw at SQLite with a single thread. Time to figure out how many SELECT queries we can throw at it.

The next program just executed 213 SELECT queries, each selecting all posts of one topic. All executed one after the other and the output is printed via printf(). In the end we want to return JSON to some hypothetical browser, and that gave me the bright idea to make SQLite do all that JSON work for me (escaping, string concatenation, etc.). Also that way the CPU can stay within the SQLite bytecode interpreter all the time, but I'm not sure that actually makes any difference. Hence I tested two queries:

SELECT id, body FROM posts WHERE topic_id = ?

SELECT json_group_array(json_object('id', id, 'body', body))
FROM posts WHERE topic_id = ?

The first just fetches the data as-is, the second accumulates everything into a JSON array and gives us that as a string. Here are the time measurements (with output redirected to /dev/null of course):

Query walltime queries/sec
Simple SELECT 0,099s 213 queries / 0.099 sec = ~2150 queries/sec
JSON aggregated SELECT 0,168s 213 queries / 0.168 sec = ~1250 queries/sec

Using JSON takes about 1.7x longer here. Maybe worth it or maybe we could create our own format and let clients decode that. I mean ~2150 vs. ~1250 queries per second, a hefty performance gain. Maybe worth optimizing a bit, but of course we would use our own format. So from now on I'll ignore JSON.

The queries are actually relatively slow compared to the ~40k req/sec we were playing around with before. But keep in mind that the average topic has ~500 posts, which is way too much. This was my first inkling that my test data distribution was probably garbage. But I didn't want to generate new test data and just rolled with it.

Step 3: Bolting HTTP and SQLite together

Now, let's just take our epoll HTTP server and SQLite and bolt them together. Well, that was the plan. Reality said "no". Turns out SQLite doesn't have an async API. At least I couldn't find one. In fact I think it's pretty much built for synchronous operation. I looked at a few bindings for various async systems (e.g. node.js) but they all seem to use a threadpool to make it look async. Not this time, not for this experiment. I almost scrapped the entire experiment at that point, basically because two of my toys didn't fit together.

Prefork to the rescue

Instead I dumped epoll and did the 2nd most "Hello World" style server design: Prefork (but with threads). The basic idea is to start a server socket and then spawn a predefined number of worker threads (e.g. 8). Each thread then calls accept() to wait for an incoming connection. When it comes it is handled with good old synchronous blocking I/O calls and when done the connection is closed. Then the thread calls accept() again and the cycle repeats. It's basically the simplest single-threaded webserver you can image but run several times in parallel.

That server design relies on the operating system to distribute incoming connections to the worker threads waiting in accept(). Also the CPU load can be somewhat erratic and finding the right number of worker threads can be a bit finicky. At least that's what I've read somewhere. But: It's simple and fits the bill for this experiment. So into the pod it goes.

Funny aside: I read about that server design maybe 20 years ago in the Apache 2 documentation. But I never used it… until now. Apache 2 uses worker processes instead of threads and child processes are created by fork(), hence the name "prefork". That gives you better isolation between the workers and you can easily recover when one worker crashes. But we don't need that here. None of that code will ever see production anyway (famous last words…).

Write-Ahead Logging

While I was searching the SQLite documentation for async stuff I also stumbled across something very interesting: Write-Ahead Logging (aka WAL mode).

I had an admittedly very outdated mental image of transactions in SQLite: Each transaction creates a new file and when that transaction is done that file is cleaned up. Many transactions = many file operations = lots of overhead = slow.

That is why I was worried about transactions when generating the posts. For this experiment I wanted every 100th request to insert a new post. Meaning 99 of 100 requests query all the posts of a random topic, 1 of 100 inserts a new post into a random topic. With that there would be a lot of transactions with INSERT statements going on. Maybe enough to become a bottleneck.

With WAL mode this isn't much of a problem anymore. To greatly oversimplify:

Every thread can have it's own database connection doing its own reading and writing, without stepping on each other's toes. Only while doing a checkpoint writers get blocked for a bit (or not, see SQLITE_CHECKPOINT_PASSIVE). And if I remember correctly it's only one fsync() syscall per checkpoint (instead of one per transaction).

Seriously, this is awesome! Kudos to whoever thought of this and implemented it.

To top it off you can even take control of the checkpointing and decide for yourself when and how to do it. So I did (hey, we're playing around here and I just found a new toy). Usually each of the database connections (worker threads in our case) would occasionally push the WAL log above the threshold (of e.g. 4 MB) and do a checkpoint. Instead I just spawned another thread that does a checkpoint every 1 second. So this dedicated "checkpointing" thread can do all the slow real I/O work while the readers and writers live it up. Does it make any sense to do it that way? I don't know, and didn't try anything else. "Spherical cow in a vacuum", remember?

Getting checkpointing to work that way required some attentive reading of the documentation, though. In the end I used the SQLITE_DEFAULT_WAL_SYNCHRONOUS=1 compile time option and the dedicated checkpointer thread periodically calls sqlite3_wal_checkpoint_v2() with mode SQLITE_CHECKPOINT_FULL.

A little before that I also discovered that I stored the whole thing on an old HDD in my PC. I was hearing a lot of faint HDD seeking sounds while running the tests (or maybe fsync() noises, don't know). And with the above configuration I just got one faint HDD seeking sound per second. Auditory I/O feedback and debugging, now that is something an SSD can't do. :D

Just for completeness sake I ran the whole thing on an SSD. Didn't made a significant difference. But then the whole database was ~200 MB large, meaning it was completely in the kernels page cache anyway. All normal I/O thus just became in-memory operations and only the occasional fsync() did any real I/O. When the whole database doesn't fit into memory you'll likely get a completely different picture.

Compiling and source code

Oh, just for reference: I was using the SQLite v3.46.0 amalgamation with the recommended compile-time options, gcc v11.4.0 and without optimization. The amalgamation packs the entire SQLite source code into a single C file (9 MB or so).

It takes 2 or 3 seconds to compile a program with that but I was too lazy to add a line to my Makefile. Those were some of the longest compile times I experienced in a few years. Usually when I do my little experiments in C everything compiles in a fraction of a second. One of the reasons I like C. On second thought, maybe I should have spent those few seconds to add a line to my Makefile…

Another fun anomaly: For some reason compiling with -O2 or even -O1 made the programs quite a bit slower (from ~4.6k req/sec down to ~830 req/sec on my PC). Maybe some funny combination of SQLite and gcc versions. Or gcc doesn't like 9 MB of C code in one large piece. Since I was I/O bound anyway I didn't investigate it and just didn't compile with optimization.

In the end the whole server program came down to ~450 lines of C code (including comments, blank lines, etc.). If you're interested in the code, just ask. I just don't want some random person taking some crappy throw-away code from somewhere on the internet and putting it into some mission critical system. That would be like taking the mad ramblings of some person, putting them in a book (or social network) and selling them as the truth. And our society doesn't have the best track record in that regard. There is good documentation on all the parts I played around with, no reason to add my crappy code to that.

Running the server on the Raspberry Pi

Ok, now on to the real meat. The test setup:

As mentioned before 1 in 100 is a POST request inserting a new post. 99 in 100 are GET requests querying all posts of a random topic. I used a wrk Lua script to do the POST request with a 1 in 100 chance (math.random(1, 100) == 1). This is the full wrk command:

./wrk -t 12 -c 400 -d 10 --script wrk_script.lua http://raspberrypi:8000/

At first I used my original test data with that pretty messed up distribution. Because I'm lazy. Posts have between 57 - 2772 characters (avg. around 1400) and each topic has between 5 - 1000 posts (avg. around 500). Way too many posts per topic. Anyway, here are the results (3 runs):

Requests/sec:  161.43    162.39    163.83
Transfer/sec:  111.36MB  111.41MB  111.72MB
CPU usage: 80% - 100% (of 400% because 4 cores)

And the full wrk output for the 2nd run (just as an example):

Running 10s test @ http://raspberrypi:8001/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   304.81ms  311.84ms   1.97s    87.31%
    Req/Sec    19.01     14.06   110.00     84.91%
  1638 requests in 10.09s, 1.10GB read
  Socket errors: connect 0, read 0, write 0, timeout 46
Requests/sec:    162.39
Transfer/sec:    111.41MB

The observant reader will notice that ~112 MB/s is the maximum transfer speed for one 1 GBit/s Ethernet connection. Also we only utilize about a fourth of the CPU. So yeah, totally I/O bound on the network connection.

I was hoping for more, but in retrospect this shouldn't be surprising with my messed up test data. 99 of 100 requests fetch all posts of a topic. And ignoring any HTTP and encoding overhead that comes to an average data of:

500 posts/topic * 1400 bytes/post = 700000 bytes/topic = 700 KByte/topic

With a 1 GBit/s Ethernet connection this can give us:

112 MB/s / 0.7 KB/topic = 160 topics/s

Those are all just back of the envelope estimates. But they match up with the measurements and confirm that we're I/O bound because of our data.

At this point I could have started to think about compression (hey, 3 of 4 cores are idle!), pre-compressing entire topics and only updating them when changed, etc. But the theme of this experiment is to be lazy and avoid difficult or complex things. So I just changed my data distribution!

Using different test data

I looked around for a bit, but the only thing I could find was How the average comment length compares between subreddits by tigeer on r/dataisbeautiful. From what I've read his source were comments posted in October July-2019 gathered using the pushshift.io API. This gives us at least a little bit of information to cook up a new distribution:

Again, highly professionally eyeballing it here. How does the Raspberry Pi handle test data with that distribution? 4 runs this time:

Requests/sec:  7204.74    6049.65    7177.06    6087.99
Transfer/sec:    32.68MB    28.30MB    34.79MB    31.41MB
CPU usage: 150% - 200%

And one of the wrk outputs:

Running 30s test @ http://raspberrypi:8001/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.96ms   69.00ms   1.67s    95.44%
    Req/Sec   512.65    185.41     1.63k    73.87%
  183166 requests in 30.09s, 0.92GB read
  Socket errors: connect 0, read 0, write 0, timeout 21
Requests/sec:   6087.99
Transfer/sec:     31.41MB

It's interesting that we're not limited by the network bandwidth (only ~30 MB/s from ~112 MB/s). But we're also not limited by the CPU since it doesn't reach 400%.

This is probably the point where the prefork server model breaks down. To test this assumption I ran it again with 16 worker threads instead of 8:

Requests/sec:  7363.26    5899.52    5907.11    6461.15
Transfer/sec:    40.19MB    33.00MB    33.87MB    39.12MB
CPU usage: 270% - 300%

It's hard to tell, but it doesn't make much of a difference in req/sec. And this is a common theme of thread-based synchronous I/O systems: You throw a lot more resources at them but at some point it barely makes a difference (diminishing returns). There's so much blocking, waiting, task switching, etc. going on that you can't really utilize the hardware resources efficiently.

Eventloops and async I/O are better in that regard, but all-in-all they're a lot more complex. We already get 6k - 7k req/sec with what amounts to an almost "Hello World!" server bolted together with SQLite. That gives it a hell of an complexity-to-performance ratio.

By the way: All of this I/O was going to the SDCard of the Pi. Not sure how much that matters since all the data would be in the kernels page cache, but fsync() will probably have to say something about that.

Running it on the PC

Just for the lols I ran the same benchmark on my PC (8-core AMD Ryzen 7 5700X) with 8 worker threads. This time everything runs on the same machine, so we won't be limited by the network bandwidth:

Requests/sec:  51716.15    42627.91    32928.37    27801.58
Transfer/sec:    265.33MB    270.44MB    240.95MB    245.43MB
CPU usage of server process:  420% - 580%
CPU usage of wrk process:    ~250%

Full CPU utilization would be 1600% because of hyper-threading. Honestly, those numbers make me shudder to think what a more server-like system with a Threadripper, 256 GB of RAM and several 10 GBit/s network ports could handle.

Very rough estimates of Reddit read/write ops

Up until now I did what every (professional) (modern) (web-)developer does: Completely ignoring the purpose of the whole thing and just playing around with my favorite tools. I sincerely hope this is irony on my part, but I let you be the judge of that. Someone once told me that "professional" only means you get paid for what you do, not that you're doing something in a proficient manner.

Anyway, I tried to get some numbers about Reddit. After all I wanted to know if one could host a greatly over-simplified Reddit-like thing on a Raspberry Pi ("spherical cow in a vacuum", remember?). Specifically how many new posts and comments come in every second. Well, I didn't find much. In the end the best I found was Reddit User and Growth Stats which seems to be based on the Reddit SEC filings and Statista. I tried to look at the SEC filings but couldn't make heads or tails of it. And the Statista data is behind a paywall.

For whatever reason they mostly care about active users, revenue and other unimportant stuff. But thankfully the page listed posts and comments by year and I only cared about 2024:

 550 million posts    in 2024
2.72 billion comments in 2024

In keeping with the theme of this experiment that estimate is good enough.

The numbers sounds awfully impressive. Nice, large numbers for which at least I don't have any reference point. Except maybe the world population or something equally abstract. But then, that stuff was made to impress investors so they give you money.

So… let's do some really sophisticated analysis and divide those yearly numbers by the number of seconds in a year. For reference: A year has more or less 365 * 24 * 60 * 60 = 31 536 000 = 31.5 million seconds.

 550 million posts    =   550 000 000 / (365*24*60*60) = 17.5 posts/s
2.72 billion comments = 2 720 000 000 / (365*24*60*60) = 86.25 comments/s

Let's call it ~100 inserts/sec in total.

For some reason I expected that number to be much much higher. Sure, this is an average and there will be spikes. And the distribution across different subreddits will be very uneven. But even then, what's going to prevent you from giving each bigger subreddit-alike its own Raspberry Pi?

Closing words

In a kind-of funny way this number really took the wind out of the entire experiment. Realistically nothing I'll ever build will come close to needing the capacity a single Raspberry Pi can provide. My motivation to try different configurations (e.g. with checkpointing) just evaporated. And mind you, this was just a few days of fooling around, avoiding complexity and maybe gluing stuff together in a slightly less stupid way than usual. At least in regards to requests per second.

Bandwidth, caching, storing accumulated data, images, videos, and so on are matters outside this experiment's particular "spherical cow in a vacuum". And I hope you keep that in mind.

Anyway, the next time someone wants to use <insert your favorite overcomplicated stack here> because we need to "scale" or "that's how everyone else does it" I can ask if we need to scale beyond one Raspberry Pi. And can back that up with some numbers. :D Which honestly don't matter anyway in such discussions, it's just about favorite toys. Might as well throw in my own as well.

I would much rather support the Raspberry Pi foundation with my money, not AWS. But I have to give the sales people of cloud providers credit, they trained web developers well. Pavlov would be proud. :D

Leave a new comment

Having thoughts on your mind about this stuff here? Want to tell me and the rest of the world your opinion? Write and post it right here. Be sure to check out the format help (focus the large text field) and give the preview button a try.

optional

Format help

Please us the following stuff to spice up your comment.

An empty line starts a new paragraph. ---- print "---- lines start/end code" ---- * List items start with a * or -

Just to keep your skill sharp and my comments clean.

or