<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Derivante &#187; Justin Leider</title>
	<atom:link href="http://www.derivante.com/author/justin/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.derivante.com</link>
	<description>to obtain or receive from a source</description>
	<lastBuildDate>Mon, 26 Apr 2010 18:44:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Extensible PHP Caching Library</title>
		<link>http://www.derivante.com/2010/04/23/extensible-php-caching-library/</link>
		<comments>http://www.derivante.com/2010/04/23/extensible-php-caching-library/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 17:31:43 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Class]]></category>
		<category><![CDATA[Drupal]]></category>
		<category><![CDATA[Flexibility]]></category>
		<category><![CDATA[Library]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[mysql]]></category>

		<guid isPermaLink="false">http://www.derivante.com/?p=765</guid>
		<description><![CDATA[Everyone has probably already seen every caching class there ever was and ever will be. However, when I was searching for a class that could easily be switched from one data store to the next I couldn't find a thing. (&#8230;)</p><p><a href="http://www.derivante.com/2010/04/23/extensible-php-caching-library/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>Everyone has probably already seen every caching class there ever was and ever will be. However, when I was searching for a class that could easily be switched from one data store to the next I couldn't find a thing. Every caching class I seemed to come by was written specifically for a single back end and with their predefined static keys/column names. Well, those silly restrictions have come to an end!</p>
<p>I present to you the <a title="PHP Caching Library" href="http://github.com/jleider/extensible-php-caching-library" target="_blank">Extensible PHP Caching Library</a> (Hosted at GitHub) - This collection of classes makes it easy to customize key or column names to your needs as well as switch from one data store to another. This extensibility is baked into the core of the library via an abstract class. This abstract Cache class takes an associative array of keys/columns and determines how to use those keys based on the type of cache back end you are using. For example: Suppose you are using a RDBMS such as MySQL. In this case the associative array will be parsed and the query built such that the key is the column name and the value is what you want to query on. However, if you chose to use a NoSQL key/value store such as Memcache then the associative array is sorted and imploded to create a single string as the key.</p>
<p>Since we always specify the key as an associative array, switching between different data stores is as simple as changing the Class name from say MCache to SQLCache in your code. Nothing else is required to change data stores. Keys, expiration dates, and data processing all stay the same and all functions are called with the same arguments. This functionality is attributed to the abstract base Cache class which guides and regulates the inheriting classes.</p>
<p>Lets get to some examples:</p>
<p>In the following example we create a new SQLCache object and pass it our associative array of keys and values. Check if there is cached data, if there isnt then do something to generate the data and then cache it. Notice how you dont have to pass the key in again when setting the cache. The key is stored within the object so you can get and set as many times as you need without having to set the key every time. Lastly we delete the cache.</p>
<pre class="php"><span style="color: #808080; font-style: italic;">// Get Cache</span>
<span style="color: #0000ff;">$cache</span> = <span style="color: #000000; font-weight: bold;">new</span> SQLCache<span style="color: #66cc66;">&#40;</span><a style="text-decoration: none;" href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'column1'</span> =&gt; <span style="color: #cc66cc;">123</span>, <span style="color: #ff0000;">'column2'</span> =&gt; <span style="color: #ff0000;">'blah'</span>, … <span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">$output</span> = <span style="color: #0000ff;">$cache</span>-&gt;<span style="color: #006600;">getCache</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">if</span><span style="color: #66cc66;">&#40;</span>!<span style="color: #0000ff;">$output</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
  <span style="color: #808080; font-style: italic;">// Do something to generate data</span>
  <span style="color: #0000ff;">$output</span> = <span style="color: #ff0000;">'some datas'</span>;
  <span style="color: #808080; font-style: italic;">// Set Cache</span>
  <span style="color: #b1b100;">if</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$cacheNow</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #808080; font-style: italic;">// Force a write to cache now</span>
    <span style="color: #0000ff;">$cache</span>-&gt;<span style="color: #006600;">setCache</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$output</span>, <span style="color: #ff0000;">'+1 day'</span>, <span style="color: #000000; font-weight: bold;">true</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #66cc66;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #808080; font-style: italic;">// Setting no time defaults to time() + 86400, one day from now.</span>
    <span style="color: #0000ff;">$cache</span>-&gt;<span style="color: #006600;">setCache</span><span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$output</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span>
&nbsp;
<a style="text-decoration: none;" href="http://www.php.net/print"><span style="color: #000066;">print</span></a> <span style="color: #0000ff;">$output</span>;
&nbsp;
<span style="color: #808080; font-style: italic;">// Delete the cache</span>
<span style="color: #0000ff;">$cache</span>-&gt;<span style="color: #006600;">deleteCache</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;</pre>
<p>If ever you needed to update this caching strategy to include Memcached the only change would be to change SQLCache to Mcache:</p>
<pre class="php"><span style="color: #0000ff;">$cache</span> = <span style="color: #000000; font-weight: bold;">new</span> SQLCache<span style="color: #66cc66;">&#40;</span><a style="text-decoration: none;" href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'column1'</span> =&gt; <span style="color: #cc66cc;">123</span>, <span style="color: #ff0000;">'column2'</span> =&gt; <span style="color: #ff0000;">'blah'</span>, … <span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
to
<span style="color: #0000ff;">$cache</span> = <span style="color: #000000; font-weight: bold;">new</span> MCache<span style="color: #66cc66;">&#40;</span><a style="text-decoration: none;" href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'key1'</span> =&gt; <span style="color: #cc66cc;">123</span>, <span style="color: #ff0000;">'key2'</span> =&gt; <span style="color: #ff0000;">'blah'</span>, … <span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;</pre>
<p>The column and key names are arbitrary and may be set to anything you want to name it. For SQL caches make sure you create a <a title="Example Cache Table Schema" href="http://github.com/jleider/extensible-php-caching-library" target="_blank">cache table</a> that has the corresponding column names and that they are indexed optimally.</p>
<p>While I have only created classes for SQL (MySQL - since there is a LIMIT 1), Memcached and File based caching, the base class can be extended to include any key/value store or any database with columns (MongoDB, CouchDB, Tokyo, Postgress, Oracle, etc). Just update the back end calls and you are good to go.</p>
<p>For anyone who updates or adds functionality please let me know so I can give credit where credit is due. This library is available under the <a title="GNU Lesser General Public License" href="http://www.gnu.org/licenses/lgpl.html" target="_blank">LGPLv3</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2010/04/23/extensible-php-caching-library/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>SOLR Filtering Performance Increase</title>
		<link>http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/</link>
		<comments>http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 15:55:44 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[SOLR]]></category>
		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://www.derivante.com/?p=689</guid>
		<description><![CDATA[A couple months ago I wrote about the terrible performance and a work around for SOLR / Lucene search engine. I discovered that performance would drop off a cliff while using filter queries to narrow search results for search queries (&#8230;)</p><p><a href="http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>A couple months ago I wrote about the <a title="100x Increase in SOLR Performance and Throughput" href="http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/">terrible performance and a work around</a> for SOLR / Lucene search engine. I discovered that performance would drop off a cliff while using filter queries to narrow search results for search queries on common terms in large indexes.  Although, it looks like the issue has been addressed in some of the latest nightly <a title="SOLR Search Engine" href="http://lucene.apache.org/solr/" target="_blank">SOLR</a> builds and is scheduled for official release with SOLR v1.4. Previous to this new version the filter queries were applied after the main query ran. This is all well and good but it doesn't help speed your query up like you think it should. The new version <a href="http://www.lucidimagination.com/blog/2009/05/27/filtered-query-performance-increases-for-solr-14/" target="_blank">applies the filters in parallel</a> to the main query significantly speeding up searches with common queries and query filters by 30% to 80% along with a 40% smaller memory footprint.</p>
<p>However, even with this speed improvement you still should consider how you structure your queries. There is no need to do a query across every field if you know you really want to filter everything down with a single filter query. Try moving that filter query (fq) into the actual query (q) as <field>:<filter>. You might be <a title="100x Increase in SOLR Performance and Throughput" href="http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/">surprised by the results</a>...</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2009/06/23/solr-filtering-performance-increase/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SOLR Performance Benchmarks – Single vs. Multi-core Index Shards</title>
		<link>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/</link>
		<comments>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/#comments</comments>
		<pubDate>Tue, 05 May 2009 22:23:13 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[SOLR]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Shards]]></category>
		<category><![CDATA[Throughput]]></category>

		<guid isPermaLink="false">http://www.derivante.com/?p=350</guid>
		<description><![CDATA[Single vs. multi-core sharded index. Which one is the right one? There is not a whole lot of information out there, especially when it comes to hard numbers and comparisons. There are a couple reasons for this. The first one (&#8230;)</p><p><a href="http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-415" title="solr_fc" src="http://www.derivante.com/wp-content/uploads/2009/05/solr_fc.jpg" alt="solr_fc" width="170" height="94" />Single vs. <a title="SOLR multi-core indexing" href="http://wiki.apache.org/solr/CoreAdmin" target="_blank">multi-core sharded index</a>. Which one is the right one? There is not a whole lot of information out there, especially when it comes to hard numbers and comparisons. There are a couple reasons for this. The first one that comes to mind is the multi-core functionality offered by <a title="SOLR Search Engine" href="http://lucene.apache.org/solr/" target="_blank">Apache SOLR</a> is very nascent. It was recently introduced with the latest SOLR v1.3 and hasn't had much time to be adopted by the SOLR community. Second, the results are dependent on your schema, index size, query types and user load. These factors can account for varying performance results. As evidenced by the following benchmarks, a multi-core SOLR index has the potential to speed up the performance of your application or cut throughput and scalability by approximately the inverse number of cores.</p>
<p style="margin-bottom: 0in; padding-left: 30px;">i.e. For n cores the maximum throughput is roughly 1/n vs. a single index.</p>
<p style="margin-bottom: 0in;">With multi-core sharded indexes the underlying assumption is that search performance improves by splitting your index into smaller chunks. These smaller shards are then faster and more efficient to search and index. However, you never get anything for free, the performance increase comes at a cost of higher CPU utilization. By breaking the index into multiple smaller pieces it makes searching and indexing on that smaller subset of the index faster, but you'll need to search each core individually for every query. Where as a single index runs one slightly slower query, a multi-core sharded query runs n queries in parallel and then combines the results.</p>
<p><span id="more-350"></span></p>
<p style="margin-bottom: 0in;">There is one problem which still needs to be worked out with the multi-core sharded index. There is no distributed IDF (inverse document frequency). This is to say, if your documents are not spread evenly across all shards then you risk a result set that is improperly ordered based on your sorts, query boosts, etc. This happens with a distributed multi-core index because the scoring of the documents takes place within each individual  core before the results are combined and the query returned.</p>
<p style="margin-bottom: 0in;">Ideally, a multi-core index is great if you need to increase the performance of your queries and can afford to sacrifice some scalability and throughput to see it through.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">Below are some charts of benchmarks that I have compiled on the CitySquares SOLR index. The specifications of the machine and indexes are as follows:</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Testing machine - Dell r900:</strong></p>
<ul>
<li>4x Quad Core Intel(R) Xeon(R) CPU 		E7340 @ 2.40GHz (16 physical cores)</li>
<li>24GB RAM</li>
<li>3x 15k RPM drives in RAID 0</li>
<li>Gig-Ethernet on a local LAN</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Index Stats:</strong></p>
<ul>
<li>14.5 Million Documents</li>
<li>13 GB total size</li>
<li> 56 fields (indexed and/or stored 	w/ various amounts of processing)</li>
<li>Fully optimized index</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Benchmarks:</strong></p>
<ul>
<li>Used Apache Bench for testing purposes from another machine on the same LAN over Gig-E.</li>
</ul>
<pre class="bash">&nbsp;
<span style="color: #808080; font-style: italic;">#!/bin/bash</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;&quot;</span> &gt; solr_results.log
<span style="color: #000000; font-weight: bold;">for</span> C <span style="color: #000000; font-weight: bold;">in</span> <span style="color: #000000;">2</span> <span style="color: #000000;">4</span> <span style="color: #000000;">8</span> <span style="color: #000000;">16</span> <span style="color: #000000;">32</span> <span style="color: #000000;">64</span> <span style="color: #000000;">128</span> <span style="color: #000000;">256</span> <span style="color: #000000;">512</span>
<span style="color: #000000; font-weight: bold;">do</span>
<span style="color: #007800;">N=</span>$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #007800;">$C</span>*<span style="color: #000000;">1000</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;ab -n$N -c$C&quot;</span> &gt;&gt; solr_results.log
ab -n<span style="color: #007800;">$N</span> -c<span style="color: #007800;">$C</span> <span style="color: #ff0000;">'http://solr:8080/solr/select?q=&lt;ID&gt;&amp;qf=&lt;FIELD&gt;&amp;fq=&lt;FIELD&gt;:&lt;ID&gt;&amp;start=0&amp;rows=20'</span> &gt;&gt; solr_results.log
<span style="color: #000000; font-weight: bold;">done</span>
&nbsp;</pre>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>For the trends in red the lower the number the better.<br />
For the trends in blue the higher the number the better.</strong></p>
<p style="margin-bottom: 0in;">
<div class="mceTemp">
<dl id="attachment_356" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Single index with no caching enabled <img class="size-full wp-image-356" title="single-index-no-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/single-index-no-cache.jpg" alt="Single index with no caching enabled" width="500" height="400" /></dt>
</dl>
</div>
<div class="mceTemp">
<dl id="attachment_355" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Single index with filterCache enabled<img class="size-full wp-image-355" title="single-index-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/single-index-cache.jpg" alt="Single index with filterCache enabled" width="500" height="400" /></dt>
</dl>
</div>
<p>We can see here in the above graph that there were no results from the 512 concurrency test. This is because there was a deadlock in the Apache Tomcat server. The max number of connections was set to 512 with an overflow of 100.  This is the cause of all the cases where there are no results for the 512 test case. Ironically the Single core without the cache managed to finish but the test with fieldCache on failed.</p>
<div class="mceTemp">
<dl id="attachment_353" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Multicore Index (2 Cores) with no caching enabled<img class="size-full wp-image-353" title="multicore-no-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/multicore-no-cache.jpg" alt="Multicore Index (2 Cores) with no caching enabled" width="500" height="400" /></dt>
</dl>
</div>
<div class="mceTemp">
<dl id="attachment_352" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Multicore Index (2 Cores) with filterCaching enabled<img class="size-full wp-image-352" title="multicore-cache" src="http://www.derivante.com/wp-content/uploads/2009/04/multicore-cache.jpg" alt="Multicore Index (2 Cores) with filterCaching enabled" width="500" height="400" /></dt>
</dl>
</div>
<p><strong>The higher the better in the following chart.</strong></p>
<div class="mceTemp">
<dl id="attachment_354" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Requests per second across all benchmarks<img class="size-full wp-image-354" title="requests-per-second" src="http://www.derivante.com/wp-content/uploads/2009/04/requests-per-second.jpg" alt="Requests per second across all benchmarks" width="500" height="400" /></dt>
</dl>
</div>
<p><strong>The lower the better in the following charts.</strong></p>
<div class="mceTemp">
<dl id="attachment_357" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Time per request across all benchmarks<img class="size-full wp-image-357" title="time-per-request" src="http://www.derivante.com/wp-content/uploads/2009/04/time-per-request.jpg" alt="Time per request across all benchmarks" width="500" height="400" /></dt>
</dl>
</div>
<p>The above graph shows the only test to finish successfully with 512 concurrent connections was the single index with caching disabled.</p>
<div class="mceTemp">
<dl id="attachment_362" class="wp-caption alignnone" style="width: 510px;">
<dt class="wp-caption-dt">Time per request across all benchmarks (truncated view)<img class="size-full wp-image-362" title="time-per-request-zoom" src="http://www.derivante.com/wp-content/uploads/2009/04/time-per-request-zoom.jpg" alt="Time per request across all benchmarks (truncated view)" width="500" height="400" /></dt>
</dl>
</div>
<p>This graph is the same as the one before without the last two concurrency levels so you can see whats going on at the beginning of the benchmark. Its still hard to see but the multi-core sharded indexes are a bit lower that the single indexes. Its clear however at the higher concurrencies that the single indexes beat the multi-core ones hands down.</p>
<p>Ive attached a <a title="SOLR Benchmarks" href="http://www.derivante.com/wp-content/uploads/2009/04/solr-blog-benchmarks.xls" target="_blank">spreadsheet</a> with actual numbers from the benchmarks since some of the charts are hard to read.</p>
<p>So there it is, take it as you will. There are definitely benefits to moving from a single index to a distributed multi-core sharded index. However, whether it works for your dataset and application is up in the air. After these benchmarks we decided that the multi-core index that had served us well on <a title="Limitations of scaling with EC2" href="http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/" target="_blank">Amazon's EC2</a> no longer worked well for us on our new managed hosting. We are currently running a single index at <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>100x Increase in SOLR Performance and Throughput</title>
		<link>http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/</link>
		<comments>http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 20:28:27 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[SOLR]]></category>
		<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://www.derivante.com/?p=341</guid>
		<description><![CDATA[Is your SOLR installation running slower than you think it should? Performance, throughput and scalability not what you are expecting or hoping? Do you constantly see that others have much higher SOLR query performance and scalability than you do? All (&#8230;)</p><p><a href="http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft" style="margin: 5px;" title="SOLR" src="http://www.derivante.com/wp-content/uploads/2009/05/solr_fc.jpg" alt="" width="170" height="94" />Is your SOLR installation running slower than you think it should? Performance, throughput and scalability not what you are expecting or hoping? Do you constantly see that others have much higher SOLR query performance and scalability than you do? All it might take to fix your woes is a simple schema or query change.</p>
<p>The following scenario I am about to describe is proof positive that you should always take the time to understand the underlying functionality of whatever operating system, programming language or application you are using. Let my oversight and 'quick fix solution' be a lesson to you, it is almost always worth the upfront cost of doing something right the first time so you don't have to keep revisiting the same issue.<br><br />
<span id="more-341"></span><br />
Before I delve into the nuances of SOLR let me first give you some background on what took place over the last half year at CitySquares. Back in the fall of last year the CitySquares website began experiencing an exponential growth in traffic. This growth was due to an expansion of its IYP (Internet Yellow Page) services into the New England and Metro New York areas. Prior to and during the beginning of the first wave of traffic growth, every business listing was powered by very large MySQL queries including a couple joins. The queries themselves weren't all that complex but they were big and unwieldy with joins on very large tables and lots of columns in the result sets. In some of the larger cities covered at the time (Manhattan, Bronx, Queens, Boston, etc) there were up to 100,000 rows of data that needed to be sorted before returning a rather small subset (20-40 rows) for each business listing page load. While this wasn't a big deal when CitySquares was still a niche Boston centric destination, it quickly became a huge burden on the MySQL servers. Some of these queries were so big the servers would run out of memory trying to crunch through a 3GB temp table and start thrashing the disks to server a request for Manhattan. We needed a better solution and quick.</p>
<p>Luckily for us we had already implemented a SOLR search engine with all the necessary data indexed from our database initially with the sole intent that search result sets shouldn't have to query the database. This worked to our advantage since it was very easy for us to modify the code base to query SOLR instead of MySQL. Both result sets were formatted as an object with the same field names and all. It was a perfect drop in replacement.</p>
<p>The SOLR solution we implemented utilized SOLR's wild card q.alt=*:* field to select all documents while applying filter query (fq) on that set to get all documents related to our filter. It was a huge win for us at the time. Not only were the queries faster than the MySQL ones, but the SOLR servers could handle more of these queries without even coming close to exhausting the server's resources. This quick and dirty solution was satisfactory for the next few months until CitySquares' next round of expansion began, where again, the queries became a burden. The second time around we didn't have another seemingly quick fix. I spent a couple days trying to figure out a better way to implement the q.alt=*:* field but to no avail I gave up and moved onto other performance optimizations.</p>
<p>Unfortunately, I didn't take the time to understand the code behind the query and I didn't understand exactly how SOLR was implementing the query in its back end process. Since I didn't understand the basis of the problem I couldn't possibly know the query could be easily re-factored. After a few weeks of high loads, 20+ on our 8 core servers, I struck up a conversation with Michael, the developer who wrote the query. We discussed how the query worked and what it needed to do and after five minutes we had discovered a much better way to structure the query. It took me only about a minute or two to re-factor the original query to produce the exact same result set. This new query was incredibly fast! I benchmarked it to be about 100x faster than the previous query and on top of that it was a simple drop in replacement!</p>
<p>From what I've deduced the original query passed a blank query string with a filter query to SOLR which in turn defaulted to the q.alt catch all first and then applied the filter on the catch all query. This is exactly the opposite of what we were expecting SOLR to do. We believed that the filter was applied first and then the q.alt was applied. However, that was not the case. while this misunderstanding wasn't ideal it wasn't too slow either with only 1.4 million documents to parse over. However once CitySquares hit the 14.5 million mark this query became unmanageable. Basically SOLR parsed over every single document in the index before applying the query filter we were using. To rectify this and regain performance and through put on our servers I simply moved the filter query statement to the query statement and specified the query field to be the same as the original filter field.</p>
<p>i.e.</p>
<p>Original query passed a blank query string with a filter query:</p>
<ul>
<li>select?q=+&amp;fq=&lt;FIELD&gt;:&lt;ID&gt;</li>
</ul>
<p>The updated query now passes the id as the query string and specifies the former filter field:</p>
<ul>
<li>select?q=&lt;ID&gt;&amp;qf=&lt;FIELD&gt;</li>
</ul>
<p>Instead of taking advantages of SOLR's and every other search engines strength of O(1) search time we were at the mercy of its worst case scenario O(n) scan time. This simple misunderstanding of how SOLR processes queries in the back end caused massive performance and throughput bottlenecks. These bottlenecks affected our short and long term infrastructure plans, and was the root cause of many performance headaches for our users, customers and IT department.</p>
<p>If this isn't proof positive that you should always take the time to understand the underlying functionality of whatever operating system, programming language or application you are using I don't know what is.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2009/04/27/100x-increase-in-solr-performance-and-throughput/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Is Amazon&#8217;s EC2 right for you?</title>
		<link>http://www.derivante.com/2009/01/26/is-amazons-ec2-right-for-you/</link>
		<comments>http://www.derivante.com/2009/01/26/is-amazons-ec2-right-for-you/#comments</comments>
		<pubDate>Mon, 26 Jan 2009 20:50:11 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[Hartware]]></category>
		<category><![CDATA[horizontal architecture]]></category>
		<category><![CDATA[horizontal database]]></category>
		<category><![CDATA[IT Infrastructure]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[Site Architecture]]></category>

		<guid isPermaLink="false">http://justinleider.com/?p=49</guid>
		<description><![CDATA[I've been asked this and similar questions quite a bit lately. But before I delve into the answer to this I want to lay the foundation and ask you a question. This one question should play a large part in (&#8230;)</p><p><a href="http://www.derivante.com/2009/01/26/is-amazons-ec2-right-for-you/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<p style="margin-bottom:0;">I've been asked this and similar questions quite a bit lately. But before I delve into the answer to this I want to lay the foundation and ask you a question. This one question should play a large part in your final assessment to go with EC2 or not. The question you should ask yourself is:</p>
<p style="margin-bottom:0;"><strong>How quickly do you actually need to scale either up or down? </strong></p>
<p style="margin-bottom:0;">The answer to this will likely influence the correct solution to your problems. The following bullet point list is how I classify levels of scalability, each one comes with its own pros and cons but generally the quicker you need something the more expensive it is going to be.</p>
<ul>
<li><strong>Immediate</strong> - within minutes - EC2 or other cloud computing networks</li>
<li><strong>Fast</strong> - within days to a week - Managed Hosting, Rackspace, The Planet, etc</li>
<li><strong>Average</strong> - within weeks to a month - Own your own hardware, Dell, HP, IBM, etc</li>
<li><strong>Corporate</strong> - within months/years - Good Luck</li>
</ul>
<p style="margin-bottom:0;">With this in mind, everyone hears the hype of EC2, with its scalability, fully managed hardware and virtualization but there really aren't that many people out there describing their experiences with it. When we made the decision to go with EC2 we did our research and due diligence before making the switch. There wasn't much to go on but the few articles and blog posts we did read were all positive. I guess we all got caught up in the hype here as well.</p>
<p style="margin-bottom:0;">Even after all our research it turns out that going with EC2 was one of the poorer IT decisions we have made. EC2 has turned out to be more expensive, more difficult to implement and with poorer performance than we had ever expected even with our worst case estimations. To top it all off, we didn't fully utilize the benefits of going with EC2 which was immediate scalability. Our traffic is relatively predictable and grows or shrinks in manageable percentages and can be scaled up within days instead of minutes. We never have any massive spikes in our traffic either up or down. Even if we did have spikes we are limited by our MySQL cluster.</p>
<p style="margin-bottom:0;">While we had to rethink a lot of our architecture to create a more horizontal platform instead of the traditional vertical scaling, MySQL was by far our biggest bottleneck. The source of the problem is rooted in Amazon's preset machine size. While they have done an adequate job of offering different types of instances with more memory in one line and more computational power in the other you are still limited to what they are offering. With the large database we have and the latencies between the instances and their permanent storage we were forced to keep as much of our database cached in RAM. Now this shouldn't have been too big a deal. Just get a machine with a ton of RAM. Well, unfortunately Amazon's biggest instance only offered us a maximum of 15GB. Needless to say this was not sufficient and forced us to adopt a cluster solution. This in and of itself is not ideal especially when you should be able to run off a single box with 32GB of RAM and access to fast local disks. However, it took us twelve (12) m1.xlarge instances to reach the level of performance and availability we desired. Not to mention the network IO latency between node and disk storage and node to node adding insult to injury.</p>
<p style="margin-bottom:0;">While the speed and size of the cluster was not desirable, it worked. However, we had to completely forfeit any sort of scalability to achieve a working database. To my knowledge there is no way to quickly and easily boot up more instances of MySQL to supplement a live cluster. In order for us to add more capacity we would have to perform a rolling reboot of every machine in the cluster. Its unfortunate that databases were not designed with EC2 in mind.</p>
<p style="margin-bottom:0;">However, there are companies who are trying to tap into this pain point. We were looking very intently at a company called Continuent who produces a MySQL cluster monitoring and management tool. Unfortunately, as of Jan 2009 the product was still in private beta and was unavailable to us. This tool would have allowed us to add nodes to the cluster on the fly without having to take it down in the process. Although, even then with this extra tool, which wasn't cheap, you still couldn't scale down the cluster without taking it off-line. As far as I am concerned, if you are already using the largest instance available to you (an m1.xlarge or c1.xlarge), there is no way to vertically scale up a database with EC2. Instead you are forced into a less than ideal environment for hosting a horizontal architecture which could have serious consequences for your code base and SQL queries.</p>
<p style="margin-bottom:0;">To be honest, EC2 offers a lot of benefits that are hard to come by with other solutions. EC2 is great for companies doing lots of non-real-time activities such as batch and queued processing. Companies who have a small database that can be cached in RAM and replicated easily will also benefit from EC2, just boot up a bunch of instances and go to town. However, the bottom line is if you have fairly consistent usage patterns and your applications are performance sensitive then there are much faster and more cost effective ways of abstracting your hardware requirements. We at citysquares are in the process of moving off of EC2 and onto a managed hosting platform. We still enjoy the benefits of leased hardware like we had with EC2 and the ability to quickly add new hardware. Granted, more servers aren't available to us at the drop of a hat but a couple days lead time to get another box up and running is more than sufficient for us. Not only that but we also have a whole team of IT people working with us to help alleviate our burden of supporting the entire hardware/software stack. We can now focus on what we do best which is our application.</p>
<p style="margin-bottom:0;">Keep in mind that there is no concrete answer as to whether EC2 or cloud computing in general will work for you or not. You need to determine if the capacity and latencies of the pre-determined instance sizes will meet your growing infrastructure needs. For us the bitter answer was a resounding no. We were able to spec out a solution in a fully managed hosting environment for about half the monthly cost of EC2 while increasing the performance of our application significantly.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">So, is Amazon's EC2 right for you?</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2009/01/26/is-amazons-ec2-right-for-you/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Limitations of Scaling with EC2</title>
		<link>http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/</link>
		<comments>http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/#comments</comments>
		<pubDate>Wed, 08 Oct 2008 20:56:48 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[AWS]]></category>
		<category><![CDATA[AWS Limitations]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[EC2 Limitations]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[Scaling]]></category>
		<category><![CDATA[Scaling with EC2]]></category>
		<category><![CDATA[Web Architecure]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=45</guid>
		<description><![CDATA[Just as with any platform you choose, EC2 has its own limitations as well. These limitations are often different and harder to overcome than what you might find while running your own hardware. Without the proper planning and development, these (&#8230;)</p><p><a href="http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom:0;">Just as with any platform you choose, EC2 has its own limitations as well. These limitations are often different and harder to overcome than what you might find while running your own hardware. Without the proper planning and development, these limitations can wind up being extremely detrimental to the well being and scalability of your website or service.</p>
<p style="margin-bottom:0;">There are quite a few blogs, articles and reviews out there that mention all the positive aspects of EC2 and I have written a few of them myself. However, I think users need to be informed of the negative aspects of a particular platform as well as the positive. I will be brief with this post as my next will focus on designing an architecture around these limitations.</p>
<p style="margin-bottom:0;">The biggest limitations of Amazon's <a href="http://aws.amazon.com/ec2" target="_blank">EC2</a> at the moment as I have experienced, are the latencies between instances, latencies between instances and storage (local, and EBS), and a lack of powerful instances with more than 15GB of RAM and 4 virtual CPUs.</p>
<p style="margin-bottom:0;">All the latency issues can all be traced back to the same root cause, a shared LAN with thousands of non localized instances all competing for bandwidth. Normally, one would think a LAN would be quick... and they generally are, especially when the servers are sitting right next to each other with a single switch sitting in between them. However, Amazon's network is much more extensive than most local LANs and chances are your packets are hitting multiple switches and routers on their way from one instance to another. Every extra node added between instances is just another few milliseconds that get added to the packet's round trip time. You can think of Amazon's LAN as a really small Internet. The layout of Amazon's LAN is very similar to that of the Internet, there is no cohesiveness or localization of instances in relation to one another. So lots of data has to go from one end of the LAN to the other, just like on the Internet. This leads to data traveling much farther than it needs to and all the congestion problems that are found on the Internet can be found on Amazon's LAN.</p>
<p style="margin-bottom:0;">For computationally intensive tasks this really isn't too big a deal but for those who rely on speedy database calls every millisecond added per request really starts adding up if you have lots of requests per page. When the CitySquares site moved from our own local servers to EC2 we noticed a 4-10x increase in query times which we attribute mainly to the high latency of the LAN. Since our servers are no longer within feet of each other, we have to contend with longer distances between instances and congestion on the LAN.</p>
<p style="margin-bottom:0;">Another thing to take into consideration is the network latency for Amazon's EBS. For applications that move around a lot of data, EBS is probably a god send as it has a high bandwidth capability. However, in CitySquares' case, we wind up doing a lot of small file transfers to and from our NFS server as well as EBS volumes. So while there is a lot of bandwidth available to us, we can't really take advantage of it, especially since we have to contend with the latency and overhead of transferring many small files. Not only are small files an issue for us but we also run our MySQL database off of an EBS volume. Swapping to disk has always been a critical issue for databases but the added overhead of network traffic can wreak havoc on your database load much more than normal disk swapping. You can think of the difference in access times from disk to disk over a network as a book on a bookcase vs a book somewhere down the hall in storage room B. Clearly the second option would take far longer to find what you are looking for and that's what you have to work with if you want to have the piece of mind of persistent storage.</p>
<p style="margin-bottom:0;">The last and most important limitation for us at <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a> was the lack of an all powerful machine. The largest instance Amazon has to offer is one with just 15GB of ram and 4 virtual CPUs. In a day and age where you can easily find machines with 64GB of RAM and 16 CPUs, you are definitely limited by Amazon. In our case, it would be much easier for us just to throw hardware at our database to scale up but the only thing we have at our disposal is a paltry 15GB of RAM. How can this be the biggest machine they offer? Instead of dividing one of those machines in quarters just give me the whole thing. It just seems ludicrous to me that the largest machine they offer is something not much more powerful than the computer I'm using right now.</p>
<p style="margin-bottom:0;">Long story short, just because you start using Amazon's AWS doesn't mean you can scale. Make sure your architecture is tolerant of higher latencies and can scale with lots of little machines because that's all you have to work with.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/10/08/the-limitations-of-scaling-with-ec2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Running your own hardware Vs EC2 and RightScale &#8212; Part 2</title>
		<link>http://www.derivante.com/2008/09/16/running-your-own-hardware-vs-ec2-and-rightscale-part-2/</link>
		<comments>http://www.derivante.com/2008/09/16/running-your-own-hardware-vs-ec2-and-rightscale-part-2/#comments</comments>
		<pubDate>Tue, 16 Sep 2008 14:33:24 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[Amazon EBS]]></category>
		<category><![CDATA[Amazon EC2]]></category>
		<category><![CDATA[CMS]]></category>
		<category><![CDATA[Drupal]]></category>
		<category><![CDATA[Elastic Block Storage]]></category>
		<category><![CDATA[Elastic Compute Cloud]]></category>
		<category><![CDATA[File Handling]]></category>
		<category><![CDATA[NFS]]></category>
		<category><![CDATA[Own Hardware]]></category>
		<category><![CDATA[rightscale]]></category>
		<category><![CDATA[Single Point of Failure]]></category>
		<category><![CDATA[Site Architecture]]></category>
		<category><![CDATA[Site Infrastructure]]></category>
		<category><![CDATA[Yahoo Best Practices]]></category>

		<guid isPermaLink="false">http://justinleider.com/?p=40</guid>
		<description><![CDATA[This week I've been reminded of a very important lesson... No matter how abstracted you are from your hardware, you still inherently rely on its smooth and consistent operation. This past week CitySquares' NFS server went down for the count (&#8230;)</p><p><a href="http://www.derivante.com/2008/09/16/running-your-own-hardware-vs-ec2-and-rightscale-part-2/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom:0;">This week I've been reminded of a very important lesson... No matter how <a title="Running your site on EC2 with RightScale vs running your own hardware." href="http://justinleider.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/" target="_blank">abstracted</a> you are from your hardware, you still inherently rely on its smooth and consistent operation.</p>
<p style="margin-bottom:0;">This past week <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a>' NFS server went down for the count and was completely unresponsive to any type of communication. In fact, the EC2 instance was so FUBAR we couldn't even terminate it from our RightScale dashboard. A post on Amazon's EC2 board was required to terminate it. Turns out the actual hardware our instance was running on had a catastrophic failure of some sort. Otherwise, at least so I'm told, server images are usually migrated off of machines running in a degraded state automatically.</p>
<p style="margin-bottom:0;">Needless to say, the very reasons for deciding against running our own hardware have come back to plague us. Granted we weren't responsible for replacing the hardware but we were still affected by the troublesome machine. We weren't just slightly affected by the loss of our NFS server either. Since we are running off of a heavily modified <a title="Drupal CMS" href="http://drupal.org" target="_blank">Drupal CMS</a> our web servers depend on having a writable files directory. As it turned out Apache just spun waiting for a response from the file system, our web services ground to a halt waiting on a machine that was never going to respond... ever. Talk about a <a title="Reliability engineering" href="http://en.wikipedia.org/wiki/Single_point_of_failure" target="_blank">single point of failure</a>! A non critical component, serving mainly images and photos managed to take down our entire production deployment.</p>
<p style="margin-bottom:0;">This event has prompted us to move forward with a rewrite of Drupal's core file handling functionality. The rewrite will include automatically directing file uploads to a separate domain name like csimg.com or something similar. Yahoo goes into more detail with their <a title="Yahoo Developer's best practices for website performance." href="http://developer.yahoo.com/performance/rules.html" target="_blank">performance best practices</a>. However, editing the Drupal core is generally frowned upon and heavily discouraged since it usually conflicts with the upgrade path and maintainability of the Drupal core becomes much more difficult. While we haven't stayed out of the Drupal core entirely, the changes we have made are minor and only for performance improvements. I believe it is possible to stay out of the core file handling by hooking into it with the nodeapi but it seems like more trouble than its worth.</p>
<p style="margin-bottom:0;">The idea behind the file handling rewrite is to serve our images and photos directly from our Co-Location while keeping a local files directory on each EC2 instance for non user committed things like CSS and JS aggregation caching among other simple cache related items coming from the Drupal core. This rewrite will allow us to run one less EC2 instance, saving us some money as well as remove our dependence on a catastrophic single point of failure.</p>
<p style="margin-bottom:0;">For the time being we have set up another NFS server. This time based on Amazon's new EBS product. I spoke about this in a <a title="Amazon releases the much anticipated Elastic Block Store" href="http://justinleider.com/2008/08/21/amazons-ebs-elastic-block-store/" target="_blank">previous post</a>. One of the issues we had when the last NFS server went down was the loss of user generated content. Once the instance went down all the storage associated with that instance went down with it. There was no way to recover from the loss, it was just gone. This is just one of the many possible problems you can run into with the cloud. While on the pro side, you don't have to worry about owning your own hardware, the con side is you cant recover from failures like you can with your own hardware. This is a very distinct difference and should be seriously considered before dumping your current architecture for the cloud.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/09/16/running-your-own-hardware-vs-ec2-and-rightscale-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Nuances of EC2 and RightScale</title>
		<link>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/</link>
		<comments>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/#comments</comments>
		<pubDate>Fri, 05 Sep 2008 15:25:07 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[citysquares]]></category>
		<category><![CDATA[Development Environment]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[rightscale]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[Server Infrastructure]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=36</guid>
		<description><![CDATA[So here it is, about two weeks have passed since CitySquares officially migrated its server infrastructure over to EC2 and RightScale. All in all, everything went relatively well. There were a few hiccups on the cut over day that left (&#8230;)</p><p><a href="http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom:0;">So here it is, about two weeks have passed since CitySquares officially migrated its server infrastructure over to EC2 and RightScale. All in all, everything went relatively well. There were a few hiccups on the cut over day that left users with some error pages. Most of these issues were related to the DNS changeover and a little confusion over whether to set up the DNS records with Amazon's internal IPs or the elastic external IPs. Common sense said to set the DNS to the external IPs but turns out we were supposed to use the internal IPs (10.0.0.0/8 and not the elastic IPs 75.0.0.0/8) when referencing machines that are within the Amazon networks. Oops.</p>
<p style="margin-bottom:0;">Other than that, Ive spent the last couple weeks smoothing everything out  and getting things working at 100%. There were a few bugs that cropped up at first, mainly IT stuff, Apache configs, htaccess issues, HAProxy issues, making sure MySQL and our NFS server was backing up correctly. All these things took precedence but lately Ive been working on trying to increase performance. At this moment I'm not entirely sure why but, our MySQL database is running queries extremely slowly. At this point it could be anything from network latency, to slow machines, to an improperly tuned config. However, MySQL performance tuning is out of the scope of this post and will be the topic of a future entry. (If a MySQL DBA is reading this and would like the opportunity to play around with EC2 and RightScale, please get in touch with me.)</p>
<p style="margin-bottom:0;">In preparation for the tuning, not only for the MySQL server but the Apache servers as well, I have been setting up a separate development environment that is exactly identical to our production. With RightScale's clone feature I was able to easily duplicate everything from one deployment to the other. That said, let me make it clear that it will copy Everything. After changing all the necessary script inputs for the dev deployment I figured I was ready to start launching the new servers... WRONG. After booting the dev master DB server as well as our dev load balancer and dev NFS server I realized that they had stolen all the IPs from our production deployment! Bad News! Needless to say, CitySquares was down for the count for the few minutes it took me to figure out what had happened, fix the mistake and then wait for Amazon to reassign the elastic IPs. So here is a friendly reminder, check the server info tab before launching and make sure it isn't going to clobber your existing elastic IPs.</p>
<p style="margin-bottom:0;">Another somewhat annoying issue I ran into while trying to copy over our MySQL S3 backup from the production bucket to the development bucket was the lack of a decent copy function. RightScale has provided copy and move functionality on a somewhat basic level. You can move or copy files either one or many at a time. However, there is a limitation to this. Each file you copy will append its location into the URL and each directory path its somewhat long. Eventually you reach the maximum URL string limit and all the effort you put into selecting the files is for nothing. Not only do you have to select every file you want to copy, you have to manually assign it to the new location. This means lots of copy and pasting. If you have a directory that has hundreds of files in it, good luck. You are better off just uploading it to a new bucket. Either way, this could have been easily solved by having a copy bucket or directory option. Problem solved.</p>
<p style="margin-bottom:0;">While these few things are annoying, they aren't show stoppers, but they are definitely things to keep in mind when using these services. I'd like to end on a positive note so Ill mention the exceptional monitoring services that are installed and configured by default on every server image we have used so far. I am extremely impressed with the out of the box functionality of the graphs and they definitely make up for the other shortcomings. They have everything I could ever want to look at and then some. From standard CPU load to the number of I/Os p/s as well as yearly, quarterly, monthly, daily and hourly time frames in three sizes, small, medium and large. All browsable via up to date thumbnail previews.</p>
<p style="margin-bottom:0;">If you are considering cloud computing, I would recommend taking a look at RightScale and Amazon's web services.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/09/05/nuances-of-ec2-and-rightscale/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Amazon&#8217;s EBS (Elastic Block Store)</title>
		<link>http://www.derivante.com/2008/08/21/amazons-ebs-elastic-block-store/</link>
		<comments>http://www.derivante.com/2008/08/21/amazons-ebs-elastic-block-store/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 14:48:55 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[EBS]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[Elastic Block Store]]></category>
		<category><![CDATA[Persistent Storage]]></category>
		<category><![CDATA[rightscale]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=28</guid>
		<description><![CDATA[I wrote just yesterday about running your own hardware vs. using EC2 and RightScale and one of the major issues I found with EC2 was the lack of a persistent storage medium. Well, I knew the folks over at Amazon (&#8230;)</p><p><a href="http://www.derivante.com/2008/08/21/amazons-ebs-elastic-block-store/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>I wrote just yesterday about <a title="Roll your own hardware or side with cloud computing?" href="http://justinleider.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/" target="_blank">running your own hardware vs. using EC2 and RightScale</a> and one of the major issues I found with EC2 was the lack of a persistent storage medium. Well, I knew the folks over at Amazon were hard at work on a new service that would allow persistent storage and turns out I received this email in my mailbox this morning:</p>
<blockquote><p>Dear AWS Developer,</p>
<p>We are pleased to announce the release of a significant new Amazon EC2 feature, Amazon Elastic Block Store (EBS), which provides persistent storage for your Amazon EC2 instances. With Amazon EBS, storage volumes can be programmatically created, attached to Amazon EC2 instances, and if even more durability is desired, can be backed with a snapshot to the Amazon Simple Storage Service (Amazon S3).</p>
<p>Prior to Amazon EBS, block storage within an Amazon EC2 instance was tied to the instance itself so that when the instance was terminated, the data within the instance was lost. Now with Amazon EBS, users can chose to allocate storage volumes that persist reliably and independently from Amazon EC2 instances. Amazon EBS volumes can be created in any size between 1 GB and 1 TB, and multiple volumes can be attached to a single instance. Additionally, for even more durable backups and an easy way to create new volumes, Amazon EBS provides the ability to create point-in-time, consistent snapshots of volumes that are then stored to Amazon S3.</p>
<p>Amazon EBS is well suited for databases, as well as many other applications that require running a file system or access to raw block-level storage. As Amazon EC2 instances are started and stopped, the information saved in your database or application is preserved in much the same way it is with traditional physical servers. Amazon EBS can be accessed through the latest Amazon EC2 APIs, and is now available in public beta.</p>
<p>We hope you enjoy this new feature and we look forward to your feedback.</p>
<p>Sincerely,</p>
<p>The Amazon EC2 team</p></blockquote>
<p>So this is indeed good news and removes the biggest con I mention about the EC2 platform!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/08/21/amazons-ebs-elastic-block-store/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Running your own hardware Vs. EC2 and RightScale</title>
		<link>http://www.derivante.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/</link>
		<comments>http://www.derivante.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 20:13:52 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[Amazon]]></category>
		<category><![CDATA[citysquares]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[Flexibility]]></category>
		<category><![CDATA[Gentoo]]></category>
		<category><![CDATA[IT Infrastructure]]></category>
		<category><![CDATA[rightscale]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[Scripting]]></category>
		<category><![CDATA[Server Hardware]]></category>
		<category><![CDATA[Servers]]></category>
		<category><![CDATA[Site Architecture]]></category>
		<category><![CDATA[Xen]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=21</guid>
		<description><![CDATA[A couple weeks ago I began working with EC2 and RightScale in preparation of our big IT infrastructure change over. Ill start by giving a brief overview of our hardware infrastructure. Currently we're running the CitySquares' website on our own (&#8230;)</p><p><a href="http://www.derivante.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom:0;">A couple weeks ago I began working with <a title="Amazon's Elastic Compute Cloud" href="http://aws.amazon.com/ec2" target="_blank">EC2</a> and <a title="RightScale" href="http://rightscale.com" target="_blank">RightScale</a> in preparation of our big IT infrastructure change over. Ill start by giving a brief overview of our hardware infrastructure. Currently we're running the <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares'</a> website on our own hardware in a <a title="Somerville Businesses" href="http://ma.citysquares.com/somerville" target="_blank">Somerville</a> co-location not too far from our headquarters in Boston's trendy <a title="Boston's trendy South End neighborhood businesses" href="http://ma.citysquares.com/boston/south-end" target="_blank">South End</a> neighborhood.</p>
<p style="margin-bottom:0;">From the very beginning our contract IT guy set us up with a extremely robust and flexible IT infrastructure. It consists of a few machines running <a title="Xen Hypervisor" href="http://www.xen.org/" target="_blank">Xen</a> Hypervisors with <a title="Gentoo Linux" href="http://www.gentoo.org/" target="_blank">Gentoo</a> as the main host OS. Running Gentoo allows us to be as efficient as possible by specifically optimizing and compiling only the things we need. While this is a good step, it is Xen that really makes the big difference. It allows us to trade around resources as we see fit, more memory here, more virtual CPUs there, all can be done on the fly. For a startup or any company with limited resources this is rather essential. You never know where you are going to need to allocate resources in the months to come.</p>
<p style="margin-bottom:0;">While this is all well and good, we are still limited when it comes to scaling with increasing traffic or adding additional resource intensive features. We have a set amount of available hardware and adding more is an expensive upfront capital investment. Not only that but in order for us to really begin to take advantage of Xen and use it to its full potential we were presented with an expensive option, it required the purchase of a <a title="SAN Storage Area Network" href="http://en.wikipedia.org/wiki/Storage_area_network" target="_blank">SAN</a> and more servers. For those in the industry I don't think I need to mention that these get expensive in a hurry. This would have been a huge upfront cost for us, one we didn't want to budget for. The second option, which is the one we eventually went with was to drop our current hardware solution and make the plunge into cloud computing with Amazon's EC2.</p>
<p style="margin-bottom:0;">Here I am now. A couple of weeks into the switch with a lot of lessons learned. There are definitely pros and cons for each platform, either going with EC2 or rolling your own architecture. Before I get into the details I want to make clear that there are many factors involved in choosing a technology platform. I am only going to scratch the surface, touching upon the major pros and cons with respect to my own opinions with best interest for CitySquares in mind.</p>
<p style="margin-bottom:0;">Let me begin by starting with the pros for running your own hardware:</p>
<ul>
<li>
<p style="margin-bottom:0;">The biggest pro is most definitely 	persistence across reboots. I can not stress the importance of this 	one. You really take for granted the ability to edit a file and 	expect it to be there the next time the machine is restarted.</p>
<ul>
<li>
<p style="margin-bottom:0;">You only need to configure the 		software once. Once its running you don't really care what you did 		to make it work. It just works, every time you reboot.</p>
</li>
<li>UPDATE 8/21/08: <a title="Amazon releases the much anticipated Elastic Block Store" href="http://justinleider.com/2008/08/21/amazons-ebs-elastic-block-store/" target="_blank">Amazon releases persistent storage</a>.</li>
</ul>
</li>
<li>
<p style="margin-bottom:0;">Complete and utter control over 	everything that is running. This extends from the OS to the amount 	of RAM, CPU specs, hard drive specs, NICs, etc. The ability to have 	a economy or performance server is all up to you.</p>
</li>
<li>
<p style="margin-bottom:0;">Rather stable and unchanging 	architecture. Server host keys stay the same, the same number of 	servers are running today as there were yesterday and as there will 	be tomorrow.</p>
</li>
<li>
<p style="margin-bottom:0;">Reboot times. For those times when 	something is just AFU you can hit the reset button and be back up 	and running in a few minutes.</p>
</li>
<li>
<p style="margin-bottom:0;">You can physically touch it... Its 	not just in the cloud somewhere.</p>
</li>
</ul>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">Some cons for running your own hardware:</p>
<ul>
<li>
<p style="margin-bottom:0;">Companies with limited resources 	usually end up with architectures that exhibit single points of 	failure.</p>
<ul>
<li>
<p style="margin-bottom:0;">As an aside, you can be plagued 		by hardware failures at any time. This usually is accompanied by 		angry emails, texts and calls at 3am on Saturday morning.</p>
</li>
</ul>
</li>
<li>
<p style="margin-bottom:0;">Limited scalability options. For a 	rapidly expanding and growing website, the couple weeks it takes to 	order and install new hardware can be detrimental to your potential 	traffic and revenue stream.</p>
</li>
<li>
<p style="margin-bottom:0;">Management of physical pieces of 	hardware. Its a royal pain to have to go to a co-location to upgrade 	or fix anything that might need maintenance. Not to mention the 	potential down time.</p>
<ul>
<li>
<p style="margin-bottom:0;">Also, there are many hidden costs 		associated with IT maintenance.</p>
</li>
</ul>
</li>
<li>
<p style="margin-bottom:0;">Up front capital expenditures can 	be quite costly. This is especially true from a cash flow 	perspective.</p>
</li>
<li>
<p style="margin-bottom:0;">Servers and other supporting 	hardware are rendered obsolete every few years requiring the 	purchase of new equipment.</p>
</li>
</ul>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">These pros and cons for running your own hardware are pretty straight forward. Some people might mention managed hosting solutions which would mostly eliminate some of the cons related to server maintenance and hardware failures. However, this added service comes with an added price tag for the hosting. Whether it is right for you or your company is something to look into. We decided to skip this intermediary solution and go straight to the latest and greatest solution which is cloud computing. To be specific we sided with Amazon's EC2 (Elastic Compute Cloud) using RightScale as our management tool.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">Some of the pros for using EC2 in conjunction with the RightScale dashboard are as follows:</p>
<ul>
<li>
<p style="margin-bottom:0;">Near infinite resources (Server 	instances, Amazon's S3 Storage, etc) available nearly 	instantaneously. No more Slashdot DoS attacks if everything is 	properly configured and set to introduce more servers automatically. 	(RightScale Benefit)</p>
</li>
<li>
<p style="margin-bottom:0;">No upfront costs, everything is 	usage based. In the middle of the night if you are only utilizing 	one server thats all you pay for. Likewise, if during peak hours 	you're running twenty servers you pay for those twenty servers. 	(Amazon Benefit, RightScale is a monthly service)</p>
</li>
<li>
<p style="margin-bottom:0;">No hardware to think of. If fifty 	servers go down at Amazon we wont even know about it. No more angry 	calls at 3am. (Amazon Benefit)</p>
</li>
<li>
<p style="margin-bottom:0;">Multiple availability zones. This 	allows us to run our master database in one zone which is completely 	separate from our slave database. So if there is an actual fire or 	power outage in one zone the others will theoretically be 	unaffected. The single points of failure mentioned before are a 	thing of the past and this is just one example. (Amazon Benefit)</p>
</li>
<li>
<p style="margin-bottom:0;">Ability to clone whole deployments 	to create testing and development environments that exactly mirror 	the current production when you need them. (RightScale Benefit)</p>
</li>
<li>
<p style="margin-bottom:0;">Security updates are taken care of 	for the most part. RightScale provides base server images which are 	customized upon boot with the latest software updates. (RightScale 	Benefit)</p>
</li>
<li>
<p style="margin-bottom:0;">Monitoring and alerting tools are 	very good and highly customizable. (RightScale Benefit)</p>
</li>
</ul>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">Some of the cons for using EC2 and RightScale:</p>
<ul>
<li>
<p style="margin-bottom:0;">No persistence after reboot. I 	can't stress this one enough! All local changes will be wiped and 	you'll start with a blank slate!</p>
<ul>
<li>
<p style="margin-bottom:0;">All user contributed changes must 		be backed up to a persistent storage medium or they will be lost! 		We back up incrementally every 15 minutes with a full backup every 		night.</p>
</li>
<li>UPDATE 8/21/08: <a title="Amazon releases the much anticipated Elastic Block Store" href="http://justinleider.com/2008/08/21/amazons-ebs-elastic-block-store/" target="_blank">Amazon releases persistent storage</a>.</li>
</ul>
</li>
<li>
<p style="margin-bottom:0;">Writing scripts to configure 	everything upon boot is a time consuming and tedious process 	requiring a lot of trial and error.</p>
</li>
<li>
<p style="margin-bottom:0;">Every reboot takes approximately 	10-20 minutes depending on the number and complexity of packages 	installed on boot. Making the previous bullet point even that much 	more painful.</p>
</li>
<li>
<p style="margin-bottom:0;">A few of the pre-configured 	scripts are written quite well. The one for MySQL is as good as they 	get. You upload a config file complete with special tags for easy on the 	fly regular expression customization. The Apache scripts on 	the other hand are about as bad as they get. Everything must be 	configured after the fact.</p>
<ul>
<li>
<p style="margin-bottom:0;">With Apache however, you'll be writing regular expressions to 		match other regular expressions. Needless to say is a royal pain and you usually end up with unreadable gibberish.</p>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">So there you have it, take it as you wish. For CitySquares, EC2 and RightScale were the best options. It allows us to scale nearly effortlessly once configured. It is also a much cheaper option up front where as owning your own hardware is generally cheaper in the long run. We did trade a lot of the pros of owning your own hardware to get the scalability and hardware abstraction of EC2. It was a tough decision for us to switch away from our current architecture but in the end it will most likely be the best decision we've made. The flexibility and scalability of the EC2 and RightScale platform are by far the biggest advantages to switching and in the end its what <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a> needs.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/08/20/running-your-own-hardware-vs-ec2-and-rightscale/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Digging into HAProxy</title>
		<link>http://www.derivante.com/2008/08/13/digging-into-haproxy/</link>
		<comments>http://www.derivante.com/2008/08/13/digging-into-haproxy/#comments</comments>
		<pubDate>Wed, 13 Aug 2008 22:59:08 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[free]]></category>
		<category><![CDATA[HAProxy]]></category>
		<category><![CDATA[high availability]]></category>
		<category><![CDATA[Load Balancing]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=19</guid>
		<description><![CDATA[Well its been a few weeks since my last posting here and there is certainly a good reason for that. Every once in a while I just need to completely unplug from technology. So it only made sense for me (&#8230;)</p><p><a href="http://www.derivante.com/2008/08/13/digging-into-haproxy/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>Well its been a few weeks since my last posting here and there is certainly a good reason for that. Every once in a while I just need to completely unplug from technology. So it only made sense for me to go away on vacation to the middle of no where up in Maine's great north woods for a couple of weeks. No computers, no cellphones, no towns, no people, just dirt logging roads, lakes, rivers, wildlife and trees. Now that I'm back and caught up I will begin to start posting regularly again.</p>
<p style="margin-bottom:0;">Getting back to reality, as the title states, this post will focus on the reasons behind using <a title="HA Proxy -- Load Balancing " href="http://http://haproxy.1wt.eu/">HAProxy</a> as well as a little bit on <a title="Hyper-Local Search Portal" href="http://citysquares.com">CitySquare's</a> implementation of the load balancer. Let me start by quoting a description of HAProxy from their website:</p>
<blockquote>
<p style="margin-bottom:0;">“HAProxy is a free, <em><strong>very</strong></em> fast and reliable solution offering <a href="http://en.wikipedia.org/wiki/High_availability">high availability</a>, <a href="http://en.wikipedia.org/wiki/Load_balancer">load balancing</a>, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting <strong>tens of thousands</strong> of connections is clearly realistic with todays hardware. “</p>
</blockquote>
<p style="margin-bottom:0;">While the high availability aspect of HAProxy is all well and good, everything is expected to be high availability these days. Any sort of downtime has become unacceptable even in the middle of the night. This is especially true when relying on search engine driven traffic. I've noticed that search engines like Google and Yahoo to name a couple, really ramp up their crawl rate in the wee hours of the morning. The crawl rate is boosted more so on weekend nights when even fewer people are searching the web and the search engines can allocate more of its resources towards web crawls. CitySquares has certainly been subject to DoS attacks by GoogleBot on Friday nights.</p>
<p style="margin-bottom:0;">This is where the load balancing aspect of HAProxy comes into play, it is one of the main reasons for choosing it as our front facing service.  With just a couple HAProxy servers we can maintain redundancy while having a nearly unlimited pool of Apache web servers to hand off requests to. We don't need any special front facing, load balancing hardware to act as a single point of failure. We can also keep some money in our pocket at the same time by utilizing a software solution. Luckily, HAProxy is open source and free to the world, licensed under the <a title="GPL v2 License Terms" href="http://www.opensource.org/licenses/gpl-2.0.php">GPL v2</a>.</p>
<p style="margin-bottom:0;">Not only does HAProxy handle our load balancing but it also serves as a central access point for DNS purposes. This solution is certainly much better than our current DNS round robin which is limited in its own right. Is this common sense? Probably, but I figured it was worth pointing out.</p>
<p style="margin-bottom:0;">Lastly, security is always a concern for heavily trafficked and high profile sites. The developer behind HAProxy has been very proactive with the program architecture and coding practices and as such HAProxy can claim it's never had a single known vulnerability in over five years. Since all front facing applications are subject to attacks from so many different sources these days, having a stable and secure application is a godsend when it comes to any sort of security related IT maintenance.</p>
<p style="margin-bottom:0;">As far as implementation goes, I suspect that eventually we might need to move the HAProxy instances onto their own dedicated servers as traffic increases. In the meantime, with EC2, we are running them in parallel with Apache on the same servers. This is purely a cost savings measure as every server instance  started with EC2 results in more cash out the door. As it is, HAProxy is incredibly fast and lean and really doesn't consume much in the way of system resources, either CPU load or memory utilization.</p>
<p style="margin-bottom:0;">There are certainly other reasons for choosing HAProxy but they are past of the scope of this post. I encourage everyone to take a serious look at HAProxy when spec'ing out a load balancer or proxy.</p>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/08/13/digging-into-haproxy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Part 2: An Architecture Overview &#8212; Apache, MySQL, Memcached, SQLite</title>
		<link>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/</link>
		<comments>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/#comments</comments>
		<pubDate>Thu, 24 Jul 2008 19:56:41 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[citysquares]]></category>
		<category><![CDATA[horizontal architecture]]></category>
		<category><![CDATA[horizontal database]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[SOLR]]></category>
		<category><![CDATA[sqlite]]></category>
		<category><![CDATA[xcache]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=11</guid>
		<description><![CDATA[In my last post I mentioned the numerous technologies which were on tap for the upcoming version of CitySquares. This installment will continue to define an overview of the underlying architecture and begin to dig a little deeper into the (&#8230;)</p><p><a href="http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<p style="margin-bottom:0;">In my last post I mentioned the numerous technologies which were on tap for the upcoming version of <a title="CitySquares Online -- Hyper Local Neighborhood Search" href="http://citysquares.com" target="_blank">CitySquares</a>. This installment will continue to define an overview of the underlying architecture and begin to dig a little deeper into the actual implementation of the technologies. The idea and focus of this new architecture is aimed at creating a much more stable and scalable platform for us to work with. Before I get into the details you'll see Ive provided a graphic representation of how the architecture will be laid out.</p>
<p style="margin-bottom:0;">
<div id="attachment_12" class="wp-caption aligncenter" style="width: 430px"><a href="http://justinleider.files.wordpress.com/2008/07/architecture-overview.jpg"><img class="size-full wp-image-12" src="http://justinleider.files.wordpress.com/2008/07/architecture-overview.jpg" alt="A visual representation of a horizontal web architecture." width="420" height="300" /></a><p class="wp-caption-text">A visual representation of a horizontal web architecture.</p></div>
<p style="margin-bottom:0;">
<p style="margin-bottom:0;">Bear with me as I explain the work flow behind this graphic as it is not 100% clear from the visual representation. First off, I run <a title="Ubuntu Linux" href="http://www.ubuntu.com/" target="_blank">Ubuntu Linux</a> which is great for just about everything I need, except for creating any sort of graphics, so I apologize in advance for the lackluster graphic. As you can see, there are a few different layers: users, <a title="HA Proxy -- Load Balancing " href="http://haproxy.1wt.eu/" target="_blank">HA Proxy</a>, Apache, <a title="High performance caching system" href="http://www.danga.com/memcached/" target="_blank">Memcached</a>, <a title="SQLite -- A small fast file based database" href="http://www.sqlite.org/" target="_blank">SQLite</a> and finally MySQL labeled as databases.</p>
<p style="margin-bottom:0;">First and foremost are our beloved users, which whom without we would have no need for a website. Starting from the beginning, the users request a page from CitySquares, from there their request is passed through one of two HA Proxy servers. The sole purpose of these two machines is to load balance the incoming requests among all our Apache web servers and serve as a failsafe for one another. Once the user's request has been accepted and forwarded along to Apache we actually begin to process the request.</p>
<p style="margin-bottom:0;">The Apache servers run PHP and XCache modules. The PHP part I feel is fairly straight forward and out of the scope of this post so I will skip that part of the architecture. XCache however, is used in conjunction with and is an enhancement to PHP. More specifically XCache is an opcode optimizer and cache. It works by removing the compilation time of PHP scripts by caching the compiled and optimized state of the PHP scripts directly in the shared memory of the Apache server. This compiled version can increase page generation times by up to 500%, speeding up overall response time and reducing server load.</p>
<p style="margin-bottom:0;">Just as with all dynamic websites most if not all the actual data is stored in databases. Gone are the days of flat files with near zero processing required. Databases are the new workhorses of the web world and as such usually become the bottle neck of the overall system. CitySquares is in a somewhat unique position, nearly all our page loads have quite a bit of location and distance based processing and nearly all of this is done in our MySQL database. So while our Apache servers are sitting idle waiting for responses from their queries, the DB is preforming the brunt of the work calculating distances between objects and the like.</p>
<p style="margin-bottom:0;">We can reduce this bottleneck in a couple of different ways, the first of which is object caching. We will use Memcached to cache objects returned from the database. Say for example, we know the distance between two businesses. We know with a fair amount of certainty that those two businesses are going to be in the same place they were an hour ago, just as they were a week ago and as they will be a day from now. So we can cache this information with an expiration time of a couple days, thus saving ourselves the expense of calculating the distance between them on every page load. Of course if a user comes by and changes the location of one of these businesses, we can expire the object in cache and replace it with a newly calculated object straight from the database on the subsequent page load. These expensive queries require large table scans and mathematical formulas calculations on every row. These query results can be cached to free up the database and allow it to do what it does best. Store and retrieve data.</p>
<p style="margin-bottom:0;">In the case where we cant find the data in Memcached, either because it doesn't yet exist or has expired we will turn to our databases. We must first query a SQLite instance which is the gate keeper between Apache and the numerous databases we have. By having a separate lookup table we can essentially divide and parcel out our data sets on a table by table basis even down to an entry by entry basis. Depending on the type of data we are requesting SQLite will provide us with the location of one database or another to query for our data.</p>
<p style="margin-bottom:0;">One could argue that this just adds another layer of latency and they would be correct. However, as scalability becomes an issue you will find that adding database replication generally results in diminishing returns.  As new servers are brought online the overhead associated with replicating writes across all the replicated servers becomes choking and creates its own bottleneck. On the other hand, with a lookup table and a horizontal database architecture we don't have to worry about database replication nearly as much. You can just as easily divide your data sets into different databases. Now how you go about this varies greatly depending on your data. For CitySquares the solution turns out to be rather simple. Everything we do is location specific so it only makes sense that each data set is only as big as its parent city. Theoretically every city and all the data related to said city could reside in its own database. As you can probably guess we are only performance limited by the biggest cities, <a title="Manhattan on CitySquares" href="http://ny.citysquares.com/manhattan" target="_blank">Manhattan</a>, <a title="Brooklyn on CitySquares" href="http://ny.citysquares.com/brooklyn" target="_blank">Brooklyn</a>, etc. In these few cases we can always fall back to bigger and better servers and or replication if necessary.</p>
<p style="margin-bottom:0;">Just as our database has become a bottleneck in our current site, our search engine is also one as well, just to a lesser extent. We can take the lessons learned from our horizontal database architecture and apply it to the search engine architecture as well. By dividing our data sets into logical partitions we can keep our data from getting too large and unwieldy;  And with these smaller data sets we can reduce or remove all together the overhead associated with replicating data over multiple machines.</p>
<p style="margin-bottom:0;">While this solution sounds great, it won't be worth the effort if every time a programmer wanted to access some data they would be required to check Memcached, then SQLite and then finally MySQL for every query. In order for this to be feasible from a programmers standpoint the programmer should never have to think about this underlying architecture. This of course I will discuss in greater detail in the upcoming installments. Stay Tuned.</p>
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/07/24/an-architecture-overview-apache-mysql-memcached-sqlite/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Part 1: A Technology Overview</title>
		<link>http://www.derivante.com/2008/07/21/a-technology-overview/</link>
		<comments>http://www.derivante.com/2008/07/21/a-technology-overview/#comments</comments>
		<pubDate>Mon, 21 Jul 2008 20:21:28 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Framework]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[mvc]]></category>
		<category><![CDATA[ORM]]></category>
		<category><![CDATA[rightscale]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[Symfony]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=5</guid>
		<description><![CDATA[This will be the first post in a multi-part series, each of the following installations will detail the technologies and implementations of the upcoming CitySquares revision. I hope to cover the entire page generation process, starting with the user's first (&#8230;)</p><p><a href="http://www.derivante.com/2008/07/21/a-technology-overview/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<p style="margin-bottom:0;">This will be the first post in a multi-part series, each of the following installations will detail the technologies and implementations of the upcoming <a title="CitySquares Online" href="http://citysquares.com" target="_blank">CitySquares</a> revision. I hope to cover the entire page generation process, starting with the user's first request to the resulting dynamically generated HTML, CSS, JS, etc. Before I dive too deep into the technical aspect of things I would like to give a brief overview of what is to come.</p>
<p style="margin-bottom:0;">For starters, CitySquares currently owns and operates its own servers in a co-location not far from our headquarters. This will be the first thing to go as we switch to Amazon's EC2 and S3 in conjunction with <a title="RightScale" href="http://rightscale.com" target="_blank">RightScale</a>. By switching off of our own hardware we will absolve ourselves of this oft troublesome and physically limiting layer. By using RightScale's server templates and management scripts we can control the precise number of servers in operation. Coping with increased or decreased load will be handled autonomously throught the RightScale interface, no more DoS by SlashDot and more more wasted cycles during off peak hours. Our server deployment will contain a few different types, each one specially tuned and selected for its specific purpose. Without getting into too much detail here, our deployment will consist of the following:</p>
<ul>
<li>
<p style="margin-bottom:0;">HA Proxy for load balancing</p>
</li>
<li>
<p style="margin-bottom:0;">Apache with PHP and <a title="XCache Opcode cahce and optimizer" href="http://xcache.lighttpd.net" target="_blank">XCache</a></p>
</li>
<li>
<p style="margin-bottom:0;"><a title="High performance caching system" href="http://www.danga.com/memcached/" target="_blank">Memcached</a></p>
</li>
<li>
<p style="margin-bottom:0;">MySQL master/slave configuration</p>
</li>
<li>
<p style="margin-bottom:0;">File server with automated 	revisioning, concatenation and minimization of css, js, etc</p>
</li>
<li>
<p style="margin-bottom:0;">Tomcat with <a title="SOLR Search Engine" href="http://lucene.apache.org/solr/" target="_blank">SOLR</a> search engine</p>
</li>
</ul>
<p>Once setup, most of the overhead associated with operating our own IT infrastructure will be removed from the equation.</p>
<p style="margin-bottom:0;">Not only will our IT situation improve but our coding environment will change dramatically as we move away from Drupal's more primitive procedural style coding practice and towards <a title="Symfony Framework" href="http://www.symfony-project.org" target="_blank">Symfony</a>'s OOP style.  Symfony is a PHP based MVC (Model, View Controller) framework. It is loosely based on RoR's (Ruby on Rails) best practices for codability and maintainability.  We will be using it with the Doctrine ORM (Object Relational Mapping) and Smarty templating engine. These architectural and IT changes will work to promote a cleaner, more efficient and maintainable coding practice. In the end all these disruptive changes will be justified, allowing us to focus on what we do best, provide users with <a title="An article describing hyper-local" href="http://searchenginewatch.com/showPage.html?page=3625971" target="_blank">hyper-local</a> search results.</p>
<p style="margin-bottom:0;">
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/07/21/a-technology-overview/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Turning the page &#8212; PHP, Symfony, ORM</title>
		<link>http://www.derivante.com/2008/07/18/turning-the-page-php-symfony-orm/</link>
		<comments>http://www.derivante.com/2008/07/18/turning-the-page-php-symfony-orm/#comments</comments>
		<pubDate>Fri, 18 Jul 2008 18:23:42 +0000</pubDate>
		<dc:creator>Justin Leider</dc:creator>
				<category><![CDATA[Framework]]></category>
		<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[CMS]]></category>
		<category><![CDATA[codability]]></category>
		<category><![CDATA[Drupal]]></category>
		<category><![CDATA[horizontal architecture]]></category>
		<category><![CDATA[maintainability]]></category>
		<category><![CDATA[oop]]></category>
		<category><![CDATA[ORM]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[smarty]]></category>
		<category><![CDATA[Symfony]]></category>

		<guid isPermaLink="false">http://justinleider.wordpress.com/?p=3</guid>
		<description><![CDATA[I have come to the conclusion that I should be cataloging my work, thoughts, theories and activities for others to read and learn from my experiences as a web engineer. Let me begin by mentioning I work at a company (&#8230;)</p><p><a href="http://www.derivante.com/2008/07/18/turning-the-page-php-symfony-orm/">Read the rest of this entry &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-425" style="margin: 15px;" title="php-med-trans-light" src="http://www.derivante.com/wp-content/uploads/2009/05/php-med-trans-light.gif" alt="php-med-trans-light" width="95" height="51" />I have come to the conclusion that I should be cataloging my work, thoughts, theories and activities for others to read and learn from my experiences as a web engineer. Let me begin by mentioning I work at a company called <a title="CitySquares Online" href="http://citysquares.com" target="_blank">CitySquares</a> and for the last year I have been working diligently on the current CitySquares site.</p>
<p>This has been a great year for me as I was given the opportunity to learn the inner workings of the Drupal CMS. While <a title="Drupal CMS" href="http://drupal.org" target="_blank">Drupal</a> is a great CMS/Framework, it is inherently still a prepackaged CMS designed for things that 99% of the community needs. CitySquares unfortunately falls within that other 1%. I must say that we have accomplished quite a bit using Drupal's community modules in conjunction with our own custom written ones. However, there are plans in the works that we would like to implement but just cant within the Drupal framework.</p>
<p>Although all is not lost. With the current iteration running and stable and gaining traffic every week I have the opportunity to turn the page and begin work on the next phase of development. This is an exciting time and I will use this medium to convey the successes as well as the issues as development here continues.</p>
<p>That said, we have decided to scrap our Drupal based architecture in favor of a more extensible framework, <a title="Symfony Framework" href="http://www.symfony-project.org" target="_blank">Symfony</a>. Symfony is a PHP based OO architecture that resembles Ruby on Rails. Not only will we gain the benefit of switching to a OO style framework but we will be using Doctrine as our ORM and Smarty as our template engine.</p>
<p>The idea is that this combination of technologies will help us alleviate two of the major problems we have with Drupal, essentially scalability and codability. Ive been toying with some ideas to help eliminate these two thorns in our side that I will discuss at a later time but look forward to hearing my ideas on a full stack horizontal architecture.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.derivante.com/2008/07/18/turning-the-page-php-symfony-orm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->
