Is your SOLR installation running slower than you think it should? Performance, throughput and scalability not what you are expecting or hoping? Do you constantly see that others have much higher SOLR query performance and scalability than you do? All it might take to fix your woes is a simple schema or query change.
The following scenario I am about to describe is proof positive that you should always take the time to understand the underlying functionality of whatever operating system, programming language or application you are using. Let my oversight and 'quick fix solution' be a lesson to you, it is almost always worth the upfront cost of doing something right the first time so you don't have to keep revisiting the same issue.
Before I delve into the nuances of SOLR let me first give you some background on what took place over the last half year at CitySquares. Back in the fall of last year the CitySquares website began experiencing an exponential growth in traffic. This growth was due to an expansion of its IYP (Internet Yellow Page) services into the New England and Metro New York areas. Prior to and during the beginning of the first wave of traffic growth, every business listing was powered by very large MySQL queries including a couple joins. The queries themselves weren't all that complex but they were big and unwieldy with joins on very large tables and lots of columns in the result sets. In some of the larger cities covered at the time (Manhattan, Bronx, Queens, Boston, etc) there were up to 100,000 rows of data that needed to be sorted before returning a rather small subset (20-40 rows) for each business listing page load. While this wasn't a big deal when CitySquares was still a niche Boston centric destination, it quickly became a huge burden on the MySQL servers. Some of these queries were so big the servers would run out of memory trying to crunch through a 3GB temp table and start thrashing the disks to server a request for Manhattan. We needed a better solution and quick.
Luckily for us we had already implemented a SOLR search engine with all the necessary data indexed from our database initially with the sole intent that search result sets shouldn't have to query the database. This worked to our advantage since it was very easy for us to modify the code base to query SOLR instead of MySQL. Both result sets were formatted as an object with the same field names and all. It was a perfect drop in replacement.
The SOLR solution we implemented utilized SOLR's wild card q.alt=*:* field to select all documents while applying filter query (fq) on that set to get all documents related to our filter. It was a huge win for us at the time. Not only were the queries faster than the MySQL ones, but the SOLR servers could handle more of these queries without even coming close to exhausting the server's resources. This quick and dirty solution was satisfactory for the next few months until CitySquares' next round of expansion began, where again, the queries became a burden. The second time around we didn't have another seemingly quick fix. I spent a couple days trying to figure out a better way to implement the q.alt=*:* field but to no avail I gave up and moved onto other performance optimizations.
Unfortunately, I didn't take the time to understand the code behind the query and I didn't understand exactly how SOLR was implementing the query in its back end process. Since I didn't understand the basis of the problem I couldn't possibly know the query could be easily re-factored. After a few weeks of high loads, 20+ on our 8 core servers, I struck up a conversation with Michael, the developer who wrote the query. We discussed how the query worked and what it needed to do and after five minutes we had discovered a much better way to structure the query. It took me only about a minute or two to re-factor the original query to produce the exact same result set. This new query was incredibly fast! I benchmarked it to be about 100x faster than the previous query and on top of that it was a simple drop in replacement!
From what I've deduced the original query passed a blank query string with a filter query to SOLR which in turn defaulted to the q.alt catch all first and then applied the filter on the catch all query. This is exactly the opposite of what we were expecting SOLR to do. We believed that the filter was applied first and then the q.alt was applied. However, that was not the case. while this misunderstanding wasn't ideal it wasn't too slow either with only 1.4 million documents to parse over. However once CitySquares hit the 14.5 million mark this query became unmanageable. Basically SOLR parsed over every single document in the index before applying the query filter we were using. To rectify this and regain performance and through put on our servers I simply moved the filter query statement to the query statement and specified the query field to be the same as the original filter field.
i.e.
Original query passed a blank query string with a filter query:
- select?q=+&fq=<FIELD>:<ID>
The updated query now passes the id as the query string and specifies the former filter field:
- select?q=<ID>&qf=<FIELD>
Instead of taking advantages of SOLR's and every other search engines strength of O(1) search time we were at the mercy of its worst case scenario O(n) scan time. This simple misunderstanding of how SOLR processes queries in the back end caused massive performance and throughput bottlenecks. These bottlenecks affected our short and long term infrastructure plans, and was the root cause of many performance headaches for our users, customers and IT department.
If this isn't proof positive that you should always take the time to understand the underlying functionality of whatever operating system, programming language or application you are using I don't know what is.
Jim,
I have benchmarked the performance differences between the old and new queries as well as their use of the fieldCache as well as the differences between single indexes and multi-core indexes.
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
Does your new query style take advantage of the filterCache as well a the old one did? I guess if overall perf is 100x better who cares.
Nice, We do similar things with our queries. I’ll see if this applies in our case. +1 to understanding whats going on before optimizing!