<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Derivante &#187; apache</title>
	<atom:link href="http://www.derivante.com/tag/apache/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.derivante.com</link>
	<description>to obtain or receive from a source</description>
	<lastBuildDate>Mon, 26 Apr 2010 18:44:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>SOLR Performance Benchmarks – Single vs. Multi-core Index Shards</title>
		<link>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/</link>
		<comments>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/#comments</comments>
		<pubDate>Tue, 05 May 2009 22:23:13 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[SOLR]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Shards]]></category>
		<category><![CDATA[Throughput]]></category>

		<guid isPermaLink="false">http://www.derivante.com/?p=350</guid>
		<description><![CDATA[Single vs. multi-core sharded index. Which one is the right one? There is not a whole lot of information out there, especially when it comes to hard numbers and comparisons. There are a couple reasons for this. The first one that comes to mind is the multi-core functionality offered by Apache SOLR is very nascent. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-415" title="solr_fc" src="http://www.derivante.com/wp-content/uploads/2009/05/solr_fc.jpg" alt="solr_fc" width="170" height="94" />Single vs. <a title="SOLR multi-core indexing" href="http://wiki.apache.org/solr/CoreAdmin" target="_blank">multi-core sharded index</a>. Which one is the right one? There is not a whole lot of information out there, especially when it comes to hard numbers and comparisons. There are a couple reasons for this. The first one that comes to mind is the multi-core functionality offered by <a title="SOLR Search Engine" href="http://lucene.apache.org/solr/" target="_blank">Apache SOLR</a> is very nascent. It was recently introduced with the latest SOLR v1.3 and hasn't had much time to be adopted by the SOLR community. Second, the results are dependent on your schema, index size, query types and user load. These factors can account for varying performance results. As evidenced by the following benchmarks, a multi-core SOLR index has the potential to speed up the performance of your application or cut throughput and scalability by approximately the inverse number of cores.</p>
<p style="margin-bottom: 0in; padding-left: 30px;">i.e. For n cores the maximum throughput is roughly 1/n vs. a single index.</p>
<p style="margin-bottom: 0in;">With multi-core sharded indexes the underlying assumption is that search performance improves by splitting your index into smaller chunks. These smaller shards are then faster and more efficient to search and index. However, you never get anything for free, the performance increase comes at a cost of higher CPU utilization. By breaking the index into multiple smaller pieces it makes searching and indexing on that smaller subset of the index faster, but you'll need to search each core individually for every query. Where as a single index runs one slightly slower query, a multi-core sharded query runs n queries in parallel and then combines the results.</p>
<p><span id="more-350"></span></p>
<p style="margin-bottom: 0in;">There is one problem which still needs to be worked out with the multi-core sharded index. There is no distributed IDF (inverse document frequency). This is to say, if your documents are not spread evenly across all shards then you risk a result set that is improperly ordered based on your sorts, query boosts, etc. This happens with a distributed multi-core index because the scoring of the documents takes place within each individual  core before the results are combined and the query returned.</p>
<p style="margin-bottom: 0in;">Ideally, a multi-core index is great if you need to increase the performance of your queries and can afford to sacrifice some scalability and throughput to see it through.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">Below are some charts of benchmarks that I have compiled on the CitySquares SOLR index. The specifications of the machine and indexes are as follows:</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Testing machine - Dell r900:</strong></p>
<ul>
<li>4x Quad Core Intel(R) Xeon(R) CPU 		E7340 @ 2.40GHz (16 physical cores)</li>
<li>24GB RAM</li>
<li>3x 15k RPM drives in RAID 0</li>
<li>Gig-Ethernet on a local LAN</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Index Stats:</strong></p>
<ul>
<li>14.5 Million Documents</li>
<li>13 GB total size</li>
<li> 56 fields (indexed and/or stored 	w/ various amounts of processing)</li>
<li>Fully optimized index</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Benchmarks:</strong></p>
<ul>
<li>Used Apache Bench for testing purposes from another machine on the same LAN over Gig-E.</li>
</ul>
<pre class="bash">&nbsp;
<span style="color: #808080; font-style: italic;">#!/bin/bash</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;&quot;</span> &gt; solr_results.log
<span style="color: #000000; font-weight: bold;">for</span> C <span style="color: #000000; font-weight: bold;">in</span> <span style="color: #000000;">2</span> <span style="color: #000000;">4</span> <span style="color: #000000;">8</span> <span style="color: #000000;">16</span> <span style="color: #000000;">32</span> <span style="color: #000000;">64</span> <span style="color: #000000;">128</span> <span style="color: #000000;">256</span> <span style="color: #000000;">512</span>
<span style="color: #000000; font-weight: bold;">do</span>
<span style="color: #007800;">N=</span>$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #007800;">$C</span>*<span style="color: #000000;">1000</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;ab -n$N -c$C&quot;</span> &gt;&gt; solr_results.log
ab -n<span style="color: #007800;">$N</span> -c<span style="color: #007800;">$C</span> <span style="color: #ff0000;">'http://solr:8080/solr/select?q=&lt;ID&gt;&amp;qf=&lt;FIELD&gt;&amp;fq=&lt;FIELD&gt;:&lt;ID&gt;&amp;start=0&amp;rows=20'</span> &gt;&gt; solr_results.log
<span style="color: #000000; font-weight: bold;">done</span>
&nbsp;</pre>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>For the trends in red the lower the number the better.<br />
For the trends in blue the higher the number the better.</strong></p>
<p style="margin-bottom: 0in;">
<div class="mceTemp">
<dl id="attachment_356" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Single index with no caching enabled <img class="size-full wp-image-356" title="single-index-no-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/single-index-no-cache.jpg" alt="Single index with no caching enabled" width="500" height="400" /></dt>
</dl>
</div>
<div class="mceTemp">
<dl id="attachment_355" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Single index with filterCache enabled<img class="size-full wp-image-355" title="single-index-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/single-index-cache.jpg" alt="Single index with filterCache enabled" width="500" height="400" /></dt>
</dl>
</div>
<p>We can see here in the above graph that there were no results from the 512 concurrency test. This is because there was a deadlock in the Apache Tomcat server. The max number of connections was set to 512 with an overflow of 100.  This is the cause of all the cases where there are no results for the 512 test case. Ironically the Single core without the cache managed to finish but the test with fieldCache on failed.</p>
<div class="mceTemp">
<dl id="attachment_353" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Multicore Index (2 Cores) with no caching enabled<img class="size-full wp-image-353" title="multicore-no-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/multicore-no-cache.jpg" alt="Multicore Index (2 Cores) with no caching enabled" width="500" height="400" /></dt>
</dl>
</div>
<div class="mceTemp">
<dl id="attachment_352" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Multicore Index (2 Cores) with filterCaching enabled<img class="size-full wp-image-352" title="multicore-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/multicore-cache.jpg" alt="Multicore Index (2 Cores) with filterCaching enabled" width="500" height="400" /></dt>
</dl>
</div>
<p><strong>The higher the better in the following chart.</strong></p>
<div class="mceTemp">
<dl id="attachment_354" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Requests per second across all benchmarks<img class="size-full wp-image-354" title="requests-per-second" src="http://www.derivante.com/wp-content/uploads/2009/04/requests-per-second.jpg" alt="Requests per second across all benchmarks" width="500" height="400" /></dt>
</dl>
</div>
<p><strong>The lower the better in the following charts.</strong></p>
<div class="mceTemp">
<dl id="attachment_357" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Time per request across all benchmarks<img class="size-full wp-image-357" title="time-per-request" src="http://www.derivante.com/wp-content/uploads/2009/04/time-per-request.jpg" alt="Time per request across all benchmarks" width="500" height="400" /></dt>
</dl>
</div>
<p>The above graph shows the only test to finish successfully with 512 concurrent connections was the single index with caching disabled.</p>
<div class="mceTemp">
<dl id="attachment_362" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Time per request across all benchmarks (truncated view)<img class="size-full wp-image-362" title="time-per-request-zoom" src="http://www.derivante.com/wp-content/uploads/2009/04/time-per-request-zoom.jpg" alt="Time per request across all benchmarks (truncated view)" width="500" height="400" /></dt>
</dl>
</div>
<p>This graph is the same as the one before without the last two concurrency levels so you can see whats going on at the beginning of the benchmark. Its still hard to see but the multi-core sharded indexes are a bit lower that the single indexes. Its clear however at the higher concurrencies that the single indexes beat the multi-core ones hands down.</p>
<p>Ive attached a <a title="SOLR Benchmarks" href="http://www.derivante.com/wp-content/uploads/2009/04/solr-blog-benchmarks.xls" target="_blank">spreadsheet</a> with actual numbers from the benchmarks since some of the charts are hard to read.</p>
<p>So there it is, take it as you will. There are definitely benefits to moving from a single index to a distributed multi-core sharded index. However, whether it works for your dataset and application is up in the air. After these benchmarks we decided that the multi-core index that had served us well on <a title="Limitations of scaling with EC2" href="http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/" target="_blank">Amazon's EC2</a> no longer worked well for us on our new managed hosting. We are currently running a single index at <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Nuances of EC2 and RightScale</title>
		<link>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/</link>
		<comments>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/#comments</comments>
		<pubDate>Fri, 05 Sep 2008 15:25:07 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[citysquares]]></category>
		<category><![CDATA[Development Environment]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[rightscale]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[Server Infrastructure]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=36</guid>
		<description><![CDATA[So here it is, about two weeks have passed since CitySquares officially migrated its server infrastructure over to EC2 and RightScale. All in all, everything went relatively well. There were a few hiccups on the cut over day that left users with some error pages. Most of these issues were related to the DNS changeover [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom:0;">So here it is, about two weeks have passed since CitySquares officially migrated its server infrastructure over to EC2 and RightScale. All in all, everything went relatively well. There were a few hiccups on the cut over day that left users with some error pages. Most of these issues were related to the DNS changeover and a little confusion over whether to set up the DNS records with Amazon's internal IPs or the elastic external IPs. Common sense said to set the DNS to the external IPs but turns out we were supposed to use the internal IPs (10.0.0.0/8 and not the elastic IPs 75.0.0.0/8) when referencing machines that are within the Amazon networks. Oops.</p>
<p style="margin-bottom:0;">Other than that, Ive spent the last couple weeks smoothing everything out  and getting things working at 100%. There were a few bugs that cropped up at first, mainly IT stuff, Apache configs, htaccess issues, HAProxy issues, making sure MySQL and our NFS server was backing up correctly. All these things took precedence but lately Ive been working on trying to increase performance. At this moment I'm not entirely sure why but, our MySQL database is running queries extremely slowly. At this point it could be anything from network latency, to slow machines, to an improperly tuned config. However, MySQL performance tuning is out of the scope of this post and will be the topic of a future entry. (If a MySQL DBA is reading this and would like the opportunity to play around with EC2 and RightScale, please get in touch with me.)</p>
<p style="margin-bottom:0;">In preparation for the tuning, not only for the MySQL server but the Apache servers as well, I have been setting up a separate development environment that is exactly identical to our production. With RightScale's clone feature I was able to easily duplicate everything from one deployment to the other. That said, let me make it clear that it will copy Everything. After changing all the necessary script inputs for the dev deployment I figured I was ready to start launching the new servers... WRONG. After booting the dev master DB server as well as our dev load balancer and dev NFS server I realized that they had stolen all the IPs from our production deployment! Bad News! Needless to say, CitySquares was down for the count for the few minutes it took me to figure out what had happened, fix the mistake and then wait for Amazon to reassign the elastic IPs. So here is a friendly reminder, check the server info tab before launching and make sure it isn't going to clobber your existing elastic IPs.</p>
<p style="margin-bottom:0;">Another somewhat annoying issue I ran into while trying to copy over our MySQL S3 backup from the production bucket to the development bucket was the lack of a decent copy function. RightScale has provided copy and move functionality on a somewhat basic level. You can move or copy files either one or many at a time. However, there is a limitation to this. Each file you copy will append its location into the URL and each directory path its somewhat long. Eventually you reach the maximum URL string limit and all the effort you put into selecting the files is for nothing. Not only do you have to select every file you want to copy, you have to manually assign it to the new location. This means lots of copy and pasting. If you have a directory that has hundreds of files in it, good luck. You are better off just uploading it to a new bucket. Either way, this could have been easily solved by having a copy bucket or directory option. Problem solved.</p>
<p style="margin-bottom:0;">While these few things are annoying, they aren't show stoppers, but they are definitely things to keep in mind when using these services. I'd like to end on a positive note so Ill mention the exceptional monitoring services that are installed and configured by default on every server image we have used so far. I am extremely impressed with the out of the box functionality of the graphs and they definitely make up for the other shortcomings. They have everything I could ever want to look at and then some. From standard CPU load to the number of I/Os p/s as well as yearly, quarterly, monthly, daily and hourly time frames in three sizes, small, medium and large. All browsable via up to date thumbnail previews.</p>
<p style="margin-bottom:0;">If you are considering cloud computing, I would recommend taking a look at RightScale and Amazon's web services.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Digging into HAProxy</title>
		<link>http://www.derivante.com/2008/08/13/digging-into-haproxy/</link>
		<comments>http://www.derivante.com/2008/08/13/digging-into-haproxy/#comments</comments>
		<pubDate>Wed, 13 Aug 2008 22:59:08 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[free]]></category>
		<category><![CDATA[HAProxy]]></category>
		<category><![CDATA[high availability]]></category>
		<category><![CDATA[Load Balancing]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=19</guid>
		<description><![CDATA[Well its been a few weeks since my last posting here and there is certainly a good reason for that. Every once in a while I just need to completely unplug from technology. So it only made sense for me to go away on vacation to the middle of no where up in Maine's great [...]]]></description>
			<content:encoded><![CDATA[<p>Well its been a few weeks since my last posting here and there is certainly a good reason for that. Every once in a while I just need to completely unplug from technology. So it only made sense for me to go away on vacation to the middle of no where up in Maine's great north woods for a couple of weeks. No computers, no cellphones, no towns, no people, just dirt logging roads, lakes, rivers, wildlife and trees. Now that I'm back and caught up I will begin to start posting regularly again.</p>
<p style="margin-bottom:0;">Getting back to reality, as the title states, this post will focus on the reasons behind using <a title="HA Proxy -- Load Balancing " href="http://http://haproxy.1wt.eu/">HAProxy</a> as well as a little bit on <a title="Hyper-Local Search Portal" href="http://citysquares.com">CitySquare's</a> implementation of the load balancer. Let me start by quoting a description of HAProxy from their website:</p>
<blockquote>
<p style="margin-bottom:0;">“HAProxy is a free, <em><strong>very</strong></em> fast and reliable solution offering <a href="http://en.wikipedia.org/wiki/High_availability">high availability</a>, <a href="http://en.wikipedia.org/wiki/Load_balancer">load balancing</a>, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting <strong>tens of thousands</strong> of connections is clearly realistic with todays hardware. “</p>
</blockquote>
<p style="margin-bottom:0;">While the high availability aspect of HAProxy is all well and good, everything is expected to be high availability these days. Any sort of downtime has become unacceptable even in the middle of the night. This is especially true when relying on search engine driven traffic. I've noticed that search engines like Google and Yahoo to name a couple, really ramp up their crawl rate in the wee hours of the morning. The crawl rate is boosted more so on weekend nights when even fewer people are searching the web and the search engines can allocate more of its resources towards web crawls. CitySquares has certainly been subject to DoS attacks by GoogleBot on Friday nights.</p>
<p style="margin-bottom:0;">This is where the load balancing aspect of HAProxy comes into play, it is one of the main reasons for choosing it as our front facing service.  With just a couple HAProxy servers we can maintain redundancy while having a nearly unlimited pool of Apache web servers to hand off requests to. We don't need any special front facing, load balancing hardware to act as a single point of failure. We can also keep some money in our pocket at the same time by utilizing a software solution. Luckily, HAProxy is open source and free to the world, licensed under the <a title="GPL v2 License Terms" href="http://www.opensource.org/licenses/gpl-2.0.php">GPL v2</a>.</p>
<p style="margin-bottom:0;">Not only does HAProxy handle our load balancing but it also serves as a central access point for DNS purposes. This solution is certainly much better than our current DNS round robin which is limited in its own right. Is this common sense? Probably, but I figured it was worth pointing out.</p>
<p style="margin-bottom:0;">Lastly, security is always a concern for heavily trafficked and high profile sites. The developer behind HAProxy has been very proactive with the program architecture and coding practices and as such HAProxy can claim it's never had a single known vulnerability in over five years. Since all front facing applications are subject to attacks from so many different sources these days, having a stable and secure application is a godsend when it comes to any sort of security related IT maintenance.</p>
<p style="margin-bottom:0;">As far as implementation goes, I suspect that eventually we might need to move the HAProxy instances onto their own dedicated servers as traffic increases. In the meantime, with EC2, we are running them in parallel with Apache on the same servers. This is purely a cost savings measure as every server instance  started with EC2 results in more cash out the door. As it is, HAProxy is incredibly fast and lean and really doesn't consume much in the way of system resources, either CPU load or memory utilization.</p>
<p style="margin-bottom:0;">There are certainly other reasons for choosing HAProxy but they are past of the scope of this post. I encourage everyone to take a serious look at HAProxy when spec'ing out a load balancer or proxy.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/08/13/digging-into-haproxy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Part 2: An Architecture Overview &#8212; Apache, MySQL, Memcached, SQLite</title>
		<link>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/</link>
		<comments>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/#comments</comments>
		<pubDate>Thu, 24 Jul 2008 19:56:41 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[citysquares]]></category>
		<category><![CDATA[horizontal architecture]]></category>
		<category><![CDATA[horizontal database]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[SOLR]]></category>
		<category><![CDATA[sqlite]]></category>
		<category><![CDATA[xcache]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=11</guid>
		<description><![CDATA[In my last post I mentioned the numerous technologies which were on tap for the upcoming version of CitySquares. This installment will continue to define an overview of the underlying architecture and begin to dig a little deeper into the actual implementation of the technologies. The idea and focus of this new architecture is aimed [...]]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<p style="margin-bottom:0;">In my last post I mentioned the numerous technologies which were on tap for the upcoming version of <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a>. This installment will continue to define an overview of the underlying architecture and begin to dig a little deeper into the actual implementation of the technologies. The idea and focus of this new architecture is aimed at creating a much more stable and scalable platform for us to work with. Before I get into the details you'll see Ive provided a graphic representation of how the architecture will be laid out.</p>
<p style="margin-bottom:0;">
<div id="attachment_12" class="wp-caption aligncenter" style="width: 430px"><a href="http://justinleider.files.wordpress.com/2008/07/architecture-overview.jpg"><img class="size-full wp-image-12" src="http://justinleider.files.wordpress.com/2008/07/architecture-overview.jpg" alt="A visual representation of a horizontal web architecture." width="420" height="300" /></a><p class="wp-caption-text">A visual representation of a horizontal web architecture.</p></div>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">Bear with me as I explain the work flow behind this graphic as it is not 100% clear from the visual representation. First off, I run <a title="Ubuntu Linux" href="http://www.ubuntu.com/" target="_blank">Ubuntu Linux</a> which is great for just about everything I need, except for creating any sort of graphics, so I apologize in advance for the lackluster graphic. As you can see, there are a few different layers: users, <a title="HA Proxy -- Load Balancing " href="http://haproxy.1wt.eu/" target="_blank">HA Proxy</a>, Apache, <a title="High performance caching system" href="http://www.danga.com/memcached/" target="_blank">Memcached</a>, <a title="SQLite -- A small fast file based database" href="http://www.sqlite.org/" target="_blank">SQLite</a> and finally MySQL labeled as databases.</p>
<p style="margin-bottom:0;">First and foremost are our beloved users, which whom without we would have no need for a website. Starting from the beginning, the users request a page from CitySquares, from there their request is passed through one of two HA Proxy servers. The sole purpose of these two machines is to load balance the incoming requests among all our Apache web servers and serve as a failsafe for one another. Once the user's request has been accepted and forwarded along to Apache we actually begin to process the request.</p>
<p style="margin-bottom:0;">The Apache servers run PHP and XCache modules. The PHP part I feel is fairly straight forward and out of the scope of this post so I will skip that part of the architecture. XCache however, is used in conjunction with and is an enhancement to PHP. More specifically XCache is an opcode optimizer and cache. It works by removing the compilation time of PHP scripts by caching the compiled and optimized state of the PHP scripts directly in the shared memory of the Apache server. This compiled version can increase page generation times by up to 500%, speeding up overall response time and reducing server load.</p>
<p style="margin-bottom:0;">Just as with all dynamic websites most if not all the actual data is stored in databases. Gone are the days of flat files with near zero processing required. Databases are the new workhorses of the web world and as such usually become the bottle neck of the overall system. CitySquares is in a somewhat unique position, nearly all our page loads have quite a bit of location and distance based processing and nearly all of this is done in our MySQL database. So while our Apache servers are sitting idle waiting for responses from their queries, the DB is preforming the brunt of the work calculating distances between objects and the like.</p>
<p style="margin-bottom:0;">We can reduce this bottleneck in a couple of different ways, the first of which is object caching. We will use Memcached to cache objects returned from the database. Say for example, we know the distance between two businesses. We know with a fair amount of certainty that those two businesses are going to be in the same place they were an hour ago, just as they were a week ago and as they will be a day from now. So we can cache this information with an expiration time of a couple days, thus saving ourselves the expense of calculating the distance between them on every page load. Of course if a user comes by and changes the location of one of these businesses, we can expire the object in cache and replace it with a newly calculated object straight from the database on the subsequent page load. These expensive queries require large table scans and mathematical formulas calculations on every row. These query results can be cached to free up the database and allow it to do what it does best. Store and retrieve data.</p>
<p style="margin-bottom:0;">In the case where we cant find the data in Memcached, either because it doesn't yet exist or has expired we will turn to our databases. We must first query a SQLite instance which is the gate keeper between Apache and the numerous databases we have. By having a separate lookup table we can essentially divide and parcel out our data sets on a table by table basis even down to an entry by entry basis. Depending on the type of data we are requesting SQLite will provide us with the location of one database or another to query for our data.</p>
<p style="margin-bottom:0;">One could argue that this just adds another layer of latency and they would be correct. However, as scalability becomes an issue you will find that adding database replication generally results in diminishing returns.  As new servers are brought online the overhead associated with replicating writes across all the replicated servers becomes choking and creates its own bottleneck. On the other hand, with a lookup table and a horizontal database architecture we don't have to worry about database replication nearly as much. You can just as easily divide your data sets into different databases. Now how you go about this varies greatly depending on your data. For CitySquares the solution turns out to be rather simple. Everything we do is location specific so it only makes sense that each data set is only as big as its parent city. Theoretically every city and all the data related to said city could reside in its own database. As you can probably guess we are only performance limited by the biggest cities, <a title="Manhattan on CitySquares" href="http://ny.citysquares.com/manhattan" target="_blank">Manhattan</a>, <a title="Brooklyn on CitySquares" href="http://ny.citysquares.com/brooklyn" target="_blank">Brooklyn</a>, etc. In these few cases we can always fall back to bigger and better servers and or replication if necessary.</p>
<p style="margin-bottom:0;">Just as our database has become a bottleneck in our current site, our search engine is also one as well, just to a lesser extent. We can take the lessons learned from our horizontal database architecture and apply it to the search engine architecture as well. By dividing our data sets into logical partitions we can keep our data from getting too large and unwieldy;  And with these smaller data sets we can reduce or remove all together the overhead associated with replicating data over multiple machines.</p>
<p style="margin-bottom:0;">While this solution sounds great, it won't be worth the effort if every time a programmer wanted to access some data they would be required to check Memcached, then SQLite and then finally MySQL for every query. In order for this to be feasible from a programmers standpoint the programmer should never have to think about this underlying architecture. This of course I will discuss in greater detail in the upcoming installments. Stay Tuned.</p>
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->