A lay explanation of the tools used to handle the Reddit effect for UReddit

In our previous blog post, we explained how we prepared for and handled the Reddit effect for the University of Reddit, which brought over half a million visits and over 100 000 unique visitors over the course of a little over 24 hours and brought an average of approximately 1,200 simultaneously active users for hours. In that post, we presupposed familiarity with all the tools we mentioned; this blog post introduces each of them.

Hosting: Of course, the first thing that is necessary in order to host a public website is a server that the domain name points to and that serves files from its hard drive to internet visitors. There are several main options for hosting: shared, virtualized, and dedicated. In shared hosting, a hosting company hosts several customers’ websites on a single server and each user is given access only to his directory on the server. This is the cheapest and simplest option, but shared hosting generally cannot handle larger amounts of traffic and the customer has no ability to change the server configuration or install new software as necessary.

The other extreme is dedicated hosting, in which one buys or rents a server that is dedicated entirely to a website and that the customer has complete control over that machine. The middle ground is virtualized hosting; a virtualized server is one that has a hypervisor installed, which essentially simulates having several operating systems installed on a single machine and operating simultaneously. The customer then has the illusion of having an entire server to himself and can configure it as he sees fit. Linode offers virtualized hosting, which is what the UReddit website runs on.

Coding: A programming language in which the website is to be written must be chosen from the outset. The initial version was written in PHP, so our update to the website, which will be discussed later, entailed a rewrite in PHP.

Software stack: A sever needs several main elements for a basic website: an HTTP server to serve content, a programming language to execute code and generate dynamic content, and a database system to persistently store data.

An HTTP server the software that reads a requested file, such as a web page or an image, and serves it to a web user’s browser. There are several options; two of the popular ones are Apache and nginx. Apache spawns a worker thread for each user, and each thread then handles serving the content. Apache integrates with other tools through a system of modules; each worker thread works with the loaded modules (such as the PHP module to execute PHP code that generates a dynamic site), executes any code that need be executed, and serves it to the user. Due to the way it is structured, it can be very memory intensive, and this resource requirement is the reason we migrated the UReddit server to nginx.

Nginx is a more lightweight and scalable web server that largely functions as a reverse proxy, which essentially means that it accepts a user request, gets the prepared content from the right place based on its configuration, and relays it to the user. It does not integrate with PHP or other languages in the way Apache does. In order to be able to serve dynamic content generated by PHP, we installed php-fpm.

php-fpm is essentially a specialized server. It listens for a request, executes the appropriate PHP file, and returns the output. nginx functions as a reverse proxy, passing execution of PHP scripts to php-fpm and then handling the process of then giving that output to the user. nginx is lightweight and php-fpm is fast and scalable, with the ability to spawn more threads dedicated to code execution as necessary, so nginx+php-fpm was a marked improvement over Apache.

Furthermore, since nginx acts as a reverse proxy, we were able to set up an additional server with its own php-fpm instance in order for nginx to have 50% of pageviews executed by each of the two server we set up to handle the load from the Reddit blog post.

Database: The original database was in MySQL, so we decided to keep MySQL rather than migrating to another software such as PostgreSQL. For an introduction to MySQL aimed at complete laymen, there is a UReddit class on that subject.

Caching: Proper caching is one of the most effective ways to reduce load on a server. One of the most popular solutions in this area is memcached. Essentially, when a website receives a lot of traffic and is serving the same content over and over and that content doesn’t change very quickly, running a full database query to read and return information from the hard drive results is very costly. memcached is essentially a database that is entirely in memory; in our rewrite of the UReddit code, every database query stores its results in memory using memcached in order to serve it much more quickly the next time it is requested, and this data is removed from memory whenever there is a change to that data in MySQL itself. memcached is very effective and increased our server’s performance significantly.

A less commonly talked about scenario for caching is PHP code execution. Again, if the same code is executed again and again, why not save the intermediate computations for reuse in order to increase performance? APC does just that with PHP opcodes and also brought a signficant performance increase.

Code structure: Lastly, the way in which the web application is written makes a world of difference. Unoptimized code, unnecessary queries, and inefficient algorithms can make the best configured server work slowly. Several months ago, it was clear that the database structure and code of the UReddit website was poor, so we created a new, efficient core around which the rest of the site functionality was built. We found this core very useful and used it in other projects, so we named it Backbone and open-sourced it; if you are interested, there is a brief explanation on its webpage.

This entry was posted in Expository, University of Reddit. Bookmark the permalink.

3 Responses to A lay explanation of the tools used to handle the Reddit effect for UReddit

  1. Pingback: Handling the Reddit Effect for UReddit | Anastas, LLC Blog

  2. Dee says:

    Actually I think the problem is beaucse you and I installed from source. I recently came across with the same issue. What I could find out, the configuration file fastcgi_params if is installed from source does not contain the line you mentioned. For instance, I have PHP-FPM + Dokuwiki, and if you want to enable redirection to HTTPS based on URL (i.e: login) the location part looks like this:location ~\.php$ { include fastcgi_params; fastcgi_param HTTPS $php_https; # DW checks $_SERVER['HTTPS'] fastcgi_pass 127.0.0.1:9000;In my case I used tcpdump -s0 -l -w -i any port 9000 | strings in order to be sure Nginx was forwarding the requests to PHP-FPM. If you try this you will see the HTTP headers, among them SCRIPT_FILENAME and SCRIPT_NAME, and the status code as well. I normally work with Debian, if you install the package from the repositories you will notice the difference in the file I mentioned. In any case,I like to get it from source, we just keep in mind to add the missing line. Great post !!

    • anastasllc says:

      We did not install from source. We run on Arch Linux and manage everything using its package manager, pacman, unless there is a need for source code modification (such as the time we integrated the PM and @ureddit.com email system so that a user checking his email would mark a PM in his website user account inbox as read).

      In any case, I’m not sure what “problem” you are addressing. We had no problems distributing load across the two servers by adding the second server’s IP/port to the server pool; indeed, seeing a new server whose IP is not associated with any domain suddenly shoot to nearly 100% CPU usage after /etc/rc.d/nginx reload, combined with the drastic decrease in response time and the disappearance of 502 errors, made it clear that the new configuration was working; no tcpdump was not necessary. On that note, perhaps tcpdump | strings is not the best way to check for expected traffic under very high load due to the amount of text that would be scrolling across the screen; netstat | grep 9000 seems as though it would be more easier.

      If you meant that creating a server pool to which IP addresses can be added and having the website that needs to scale pass requests via FastCGI to a server pool is not a part of the standard nginx.conf, then that is to be expected. Any server configuration file should be tuned to the server it is configuring in order to optimize performance and standard configuration files are aimed at the lowest common denominator.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>