Part 2: An Architecture Overview — Apache, MySQL, Memcached, SQLite
Written By: Justin Leider
Web Architecture, Web Technology
In my last post I mentioned the numerous technologies which were on tap for the upcoming version of CitySquares. This installment will continue to define an overview of the underlying architecture and begin to dig a little deeper into the actual implementation of the technologies. The idea and focus of this new architecture is aimed at creating a much more stable and scalable platform for us to work with. Before I get into the details you'll see Ive provided a graphic representation of how the architecture will be laid out.
Bear with me as I explain the work flow behind this graphic as it is not 100% clear from the visual representation. First off, I run Ubuntu Linux which is great for just about everything I need, except for creating any sort of graphics, so I apologize in advance for the lackluster graphic. As you can see, there are a few different layers: users, HA Proxy, Apache, Memcached, SQLite and finally MySQL labeled as databases.
First and foremost are our beloved users, which whom without we would have no need for a website. Starting from the beginning, the users request a page from CitySquares, from there their request is passed through one of two HA Proxy servers. The sole purpose of these two machines is to load balance the incoming requests among all our Apache web servers and serve as a failsafe for one another. Once the user's request has been accepted and forwarded along to Apache we actually begin to process the request.
The Apache servers run PHP and XCache modules. The PHP part I feel is fairly straight forward and out of the scope of this post so I will skip that part of the architecture. XCache however, is used in conjunction with and is an enhancement to PHP. More specifically XCache is an opcode optimizer and cache. It works by removing the compilation time of PHP scripts by caching the compiled and optimized state of the PHP scripts directly in the shared memory of the Apache server. This compiled version can increase page generation times by up to 500%, speeding up overall response time and reducing server load.
Just as with all dynamic websites most if not all the actual data is stored in databases. Gone are the days of flat files with near zero processing required. Databases are the new workhorses of the web world and as such usually become the bottle neck of the overall system. CitySquares is in a somewhat unique position, nearly all our page loads have quite a bit of location and distance based processing and nearly all of this is done in our MySQL database. So while our Apache servers are sitting idle waiting for responses from their queries, the DB is preforming the brunt of the work calculating distances between objects and the like.
We can reduce this bottleneck in a couple of different ways, the first of which is object caching. We will use Memcached to cache objects returned from the database. Say for example, we know the distance between two businesses. We know with a fair amount of certainty that those two businesses are going to be in the same place they were an hour ago, just as they were a week ago and as they will be a day from now. So we can cache this information with an expiration time of a couple days, thus saving ourselves the expense of calculating the distance between them on every page load. Of course if a user comes by and changes the location of one of these businesses, we can expire the object in cache and replace it with a newly calculated object straight from the database on the subsequent page load. These expensive queries require large table scans and mathematical formulas calculations on every row. These query results can be cached to free up the database and allow it to do what it does best. Store and retrieve data.
In the case where we cant find the data in Memcached, either because it doesn't yet exist or has expired we will turn to our databases. We must first query a SQLite instance which is the gate keeper between Apache and the numerous databases we have. By having a separate lookup table we can essentially divide and parcel out our data sets on a table by table basis even down to an entry by entry basis. Depending on the type of data we are requesting SQLite will provide us with the location of one database or another to query for our data.
One could argue that this just adds another layer of latency and they would be correct. However, as scalability becomes an issue you will find that adding database replication generally results in diminishing returns. As new servers are brought online the overhead associated with replicating writes across all the replicated servers becomes choking and creates its own bottleneck. On the other hand, with a lookup table and a horizontal database architecture we don't have to worry about database replication nearly as much. You can just as easily divide your data sets into different databases. Now how you go about this varies greatly depending on your data. For CitySquares the solution turns out to be rather simple. Everything we do is location specific so it only makes sense that each data set is only as big as its parent city. Theoretically every city and all the data related to said city could reside in its own database. As you can probably guess we are only performance limited by the biggest cities, Manhattan, Brooklyn, etc. In these few cases we can always fall back to bigger and better servers and or replication if necessary.
Just as our database has become a bottleneck in our current site, our search engine is also one as well, just to a lesser extent. We can take the lessons learned from our horizontal database architecture and apply it to the search engine architecture as well. By dividing our data sets into logical partitions we can keep our data from getting too large and unwieldy; And with these smaller data sets we can reduce or remove all together the overhead associated with replicating data over multiple machines.
While this solution sounds great, it won't be worth the effort if every time a programmer wanted to access some data they would be required to check Memcached, then SQLite and then finally MySQL for every query. In order for this to be feasible from a programmers standpoint the programmer should never have to think about this underlying architecture. This of course I will discuss in greater detail in the upcoming installments. Stay Tuned.
Related Posts
- Is Amazon's EC2 right for you? I've been asked this and similar questions quite a bit lately. But before I delve into the answer to this I want to lay the foundation and ask you a...
- SEO: Taking control of search In my experience the majority of web agencies and developers still do not take search seriously enough. Most businesses have very simple requests, "How do I show up for keyword...
- There and Back Again, an EC2 MySQL Cluster Limitations of EC2 as a web platform: Price- An m1.xlarge instance will run you ~$600 with data transfer costs. Managed hosting solutions run cheaper especially if you plan on purchasing...
Tags: apache, architecture, citysquares, horizontal architecture, horizontal database, memcached, mysql, PHP, scalability, SOLR, sqlite, Web Technology, xcache

August 5th, 2008 at 7:47 am
Keep posting! I would like to see this evolving. Thanks for the insights already.
August 13th, 2008 at 10:29 am
I found your site on technorati and read a few of your other posts. Keep up the good work. I just added your RSS feed to my Google News Reader. Looking forward to reading more from you down the road!
August 26th, 2008 at 12:11 am
Your analysis of the scaling problem and solution are excellent. I did a talk on this kind of architecture, but using a more avant-garde approach last Nov. [http://www.jroller.com/owentaylor/date/20080407]
I call it Transparent Partitioning & Collocation.
To implement such a solution, I prefer to use memory more agressively. Are you aware of GigaSpaces? – it is an in-memory partitioning solution that also runs on EC2. From what I can tell it may suit your needs perfectly.
Let me know if you would like to hack something together.
Owen.