In our previous blog post, we explained how we prepared for and handled the Reddit effect for the University of Reddit, which brought over half a million visits and over 100 000 unique visitors over the course of a little over 24 hours and brought an average of approximately 1,200 simultaneously active users for hours. In that post, we presupposed familiarity with all the tools we mentioned; this blog post introduces each of them.
Hosting: Of course, the first thing that is necessary in order to host a public website is a server that the domain name points to and that serves files from its hard drive to internet visitors. There are several main options for hosting: shared, virtualized, and dedicated. In shared hosting, a hosting company hosts several customers’ websites on a single server and each user is given access only to his directory on the server. This is the cheapest and simplest option, but shared hosting generally cannot handle larger amounts of traffic and the customer has no ability to change the server configuration or install new software as necessary.
The other extreme is dedicated hosting, in which one buys or rents a server that is dedicated entirely to a website and that the customer has complete control over that machine. The middle ground is virtualized hosting; a virtualized server is one that has a hypervisor installed, which essentially simulates having several operating systems installed on a single machine and operating simultaneously. The customer then has the illusion of having an entire server to himself and can configure it as he sees fit. Linode offers virtualized hosting, which is what the UReddit website runs on.
Coding: A programming language in which the website is to be written must be chosen from the outset. The initial version was written in PHP, so our update to the website, which will be discussed later, entailed a rewrite in PHP.
Software stack: A sever needs several main elements for a basic website: an HTTP server to serve content, a programming language to execute code and generate dynamic content, and a database system to persistently store data.
An HTTP server the software that reads a requested file, such as a web page or an image, and serves it to a web user’s browser. There are several options; two of the popular ones are Apache and nginx. Apache spawns a worker thread for each user, and each thread then handles serving the content. Apache integrates with other tools through a system of modules; each worker thread works with the loaded modules (such as the PHP module to execute PHP code that generates a dynamic site), executes any code that need be executed, and serves it to the user. Due to the way it is structured, it can be very memory intensive, and this resource requirement is the reason we migrated the UReddit server to nginx.
Nginx is a more lightweight and scalable web server that largely functions as a reverse proxy, which essentially means that it accepts a user request, gets the prepared content from the right place based on its configuration, and relays it to the user. It does not integrate with PHP or other languages in the way Apache does. In order to be able to serve dynamic content generated by PHP, we installed php-fpm.
php-fpm is essentially a specialized server. It listens for a request, executes the appropriate PHP file, and returns the output. nginx functions as a reverse proxy, passing execution of PHP scripts to php-fpm and then handling the process of then giving that output to the user. nginx is lightweight and php-fpm is fast and scalable, with the ability to spawn more threads dedicated to code execution as necessary, so nginx+php-fpm was a marked improvement over Apache.
Furthermore, since nginx acts as a reverse proxy, we were able to set up an additional server with its own php-fpm instance in order for nginx to have 50% of pageviews executed by each of the two server we set up to handle the load from the Reddit blog post.
Database: The original database was in MySQL, so we decided to keep MySQL rather than migrating to another software such as PostgreSQL. For an introduction to MySQL aimed at complete laymen, there is a UReddit class on that subject.
Caching: Proper caching is one of the most effective ways to reduce load on a server. One of the most popular solutions in this area is memcached. Essentially, when a website receives a lot of traffic and is serving the same content over and over and that content doesn’t change very quickly, running a full database query to read and return information from the hard drive results is very costly. memcached is essentially a database that is entirely in memory; in our rewrite of the UReddit code, every database query stores its results in memory using memcached in order to serve it much more quickly the next time it is requested, and this data is removed from memory whenever there is a change to that data in MySQL itself. memcached is very effective and increased our server’s performance significantly.
A less commonly talked about scenario for caching is PHP code execution. Again, if the same code is executed again and again, why not save the intermediate computations for reuse in order to increase performance? APC does just that with PHP opcodes and also brought a signficant performance increase.
Code structure: Lastly, the way in which the web application is written makes a world of difference. Unoptimized code, unnecessary queries, and inefficient algorithms can make the best configured server work slowly. Several months ago, it was clear that the database structure and code of the UReddit website was poor, so we created a new, efficient core around which the rest of the site functionality was built. We found this core very useful and used it in other projects, so we named it Backbone and open-sourced it; if you are interested, there is a brief explanation on its webpage.