This is an overview of the technical side of handling the Reddit effect brought on by a Reddit blog post; for an analysis of the effects of this blog post from the point of view of community metrics and social media, please refer to this analysis by Elliot Volkman at Play This Magazine. For a lay explanation of the technologies talked about below, refer to our blog post introducing each of them.
Anastas, LLC donates the hosting for the University of Reddit project, taking care of managing and optimizing the server to be able to handle high load. At first, the website was running on a small Linode 512 running a vanilla Apache, MySQL, and PHP stack on Arch Linux. The first time the site received an influx of traffic, we became aware that the site had potential to suddenly receive large influxes of traffic and that it had to be able to scale more easily. On Monday August 20, 2012, there was a Reddit blog post about the University of Reddit that resulting in exactly such an influx of traffic, averaging 1,200 simultaneous active users (as reported by Google Analytics) for hours. Here is how we successfully handled it.
The first steps taken to prepare for growth were taken months before the above mentioned blog post. The easiest first step was to upgrade to a Linode 784 instance as Apache ate all the available memory after just a small burst. It was obvious that this wouldn’t be enough, however; Apache simply isn’t built in a way that lets it directly and solely handle a large burst of traffic, and vanilla PHP can be slow. For that reason, we migrated to nginx, and then php-fpm to handle the PHP execution. We further installed an opcode caching software, APC, to help speed execution. We configured nginx to serve static files directly and to pass PHP file requests to a server pool with just one server: localhost.
Lastly, perhaps the easiest speed increase came from memcached. This should be fairly self-explanatory.
The speed increase from this change was immediately obvious. A static HTML file would be served completely in under 150ms average. Another burst of traffic from Reddit soon came, and it became clear that there were issues with the site itself. This brought a code rewrite, the core of which we split off into our open sourced Backbone project. The more efficient code proved to scale much better.
This configuration worked well. But then the Reddit effect came. The site took over 15 seconds to load – when nginx wasn’t returning a 502 Bad Gateway error, that is, which was most of the time. It was clear that this single server simply couldn’t handle the load.
It was obvious that something had to be done, but we couldn’t very well migrate to a more flexible service such as Amazon AWS and wait for DNS propagation while being frontpaged on Reddit, so that left spreading the load across another Linode instance.
We purchased a Linode 512 instance and gave it a hostname of Basil, after the painter in the Wilde novel (the main server being named Oscar). A quick glance at htop showed that MySQL was using 35% of the CPU resources. Our first guess was that, since the majority of the traffic was coming to the index page and the index page was very database-heavy, that the first move should have been to move MySQL to Basil. This was easy (install MySQL, dump and migrate the existing data, comment out the skip-networking directive, bind to the right IP address, allow access only from Oscar) and was done quickly.
But disappointment sank in. Oscar stayed as 100% CPU utilization and Unix load of about 22, even after spinning up to 24 php-fpm threads, while Basil was unfazed. memcached was doing its job better than we thought. Clearly, the issue lay with computation and php-fpm had to be spread across the two servers. The first instinct was to scp Oscar’s public_html directory to Basil in order to install php-fpm on Basil, scp over the php-fpm.conf and php.ini files, and add Basil to the nginx server pool. However, the directory structure on Oscar had many left over files that amounted to gigabytes, and that transfer certainly wasn’t going to happen quickly during this sort of load; weeding out the large files would have taken too long with 1,200 active users constantly getting 502 errors.
Luckily, we manage a Beanstalk account for UReddit to use for automatically deploying git commits to the production environment. We added another deployment environment, Basil, to the account and pushed a manual deployment in order to get the minimally necessary files over to Basil. This took only a couple minutes. We updated the nginx configuration – only to find that Oscar’s CPU utilization remained almost unchanged while Basil’s jumped to 100% as well!
All was well, however. It turned out that these two servers were just enough to handle the full Reddit effect. Over the course of a little over 24 hours, UReddit garnered half a million pageviews from over 100,000 unique visitors, doubled its userbase size, had all pages loading in 1-2 seconds (3-4 for the index page), and threw no 502 errors after Basil’s deployment.
Of course, once the load let off, we killed Basil. The job was done and the Reddit effect had been survived.