That went well, what now?
by Clay vanSchalkwijk on March 18, 2009
It has been about a month now since the roll out and you can see the traffic trends rising since we started this process back in January. At the rate google is crawilng the data, the projection is that traffic will continue to rise well into the fall as everything is indexed.
With that said, we are about to surpass several sites on the way of traffic including reddit.com, fark.com, mcdonalds.com, and ibm.com to name a few. As a developer, seeing the metrics come back helps motivate and encourage the work that I've done. Even now we are still dealing with speed bumps along the way. None of which are noticeable as far as traffic is concerned but this maintained scalability is certainly a huge task. Using Drupal as a back end has proven that there are several challenges with how we proceed going forward. We've decided to scrap the MySQL Master/Master replication due to Drupal's sequences tables and duplicate key problems. An issue easily fixed if only auto increment was used... but alas without rewriting a good chunk of the code base going forward we must adapt to Master/Slave Read/Write splitting. It seems a week does not go by without encountering a scaling/replication pitfall. Drupal's general compatibility attitude torwards their framework makes it very difficult to leverage any perticular technology like MySQL to it's maximum because the database layer is written with several database backends in mind. A word of caution going for other developers that when they plan on creating a high traffic web site, there is a point where an up front investment in the infrastructure and backend will pay off huge. I believe we're reaching that point.
The unfortunate part with rapid growth is if the team is capable of adjusting at the same pace. While there is only but so much that can be planned ahead, now more than ever it is important that issues are indetified long before the become customer facing because the stakes are so much higher. Despite a successful launch, there is still a lot more ahead. How much time do we invest into new features, maintenance, and re-writes? What takes a higher priority, growth or consumer experience? Do we have the resources to invest in research and development?
At the end of every milestone, I find it necessary everyone pats themselves on the back, take deep breath, regroup as a team, and the cycle begins all over again. The gaps in between the end of one project and a start of another is the most important time for management and development to be in step with each other so everyone can move forward rowing in the same direction. Revisit company values, mission statements, and have meaningfull follow up discussions on what went well and what didn't. If as a team there is no time allocated for dialogue, despite accomplishing the task at hand, the same problems will occur over and over again. Not all problems in development are technical-- process and communication are consistent issues that seems to always manifest one way or another when working in a collaborative enviroment and it's important to determine what works well in the current situation. What may have worked in the past on a project, at a previous job, or for one person might not work now.
Congratulations on a job well done, let's open the dialogue and relish in the reflection time... that went well, what now?
Drupal Scaling, not so much
by Clay vanSchalkwijk on February 25, 2009
After working on the migration this week and moving CitySquares.com over to a new environment we ended the week with a load test from Soasta. Starting at several hundred concurrent users to a cap of 2500 simulated users on at once we were able to really pick apart the infrastructure. Couple this with our own internal testing and tweaking with ab over the past few days I've come to the conclusion that Drupal does not scale so well when you're talking about traffic in the tens of millions of page views a month. It's just not possible to get requests in and out fast enough without continually adding more boxes to the front end. I tried every setup from mod_php to php-fcgi, worker, prefork, eaccelerator, xcache, and every configuration therein.
I realized that I was over-thinking the problem so I did a little test to just see how much overhead Drupal is adding. I can execute ten times the number of queries natively through PHP and mysql_connect than if I were to execute those queries from within Drupal with just bootstrap included. Apache answered requests and processed these several million round trips to the database much faster. I understand that as a framework there is a lot being loaded, but a whole decimal point off when dealing with requests/sec is a huge overhead.
The better part of the last few days was spent gutting the inner workings of Drupal and removing as much of it as possible to lover overhead, reduce response time, and hopefully let us scale a little further on the hardware that we have. As it is, traffic is growing a steady 40-50% month over month for a while now and we are already dealing with several million unique visitors a month. That number is growing rapidly with no ceiling in the near future and it's unfortunate that at this point the only way to make this work is to break the upgrade path of Drupal.
Going through the code base, it's easy to see where so much of the bloat comes from. With comments like: "TODO: remove this when we require at least PHP 4.4.0" there is a lot of backwards compatibility that is required to support such a large community. I am seeing more and more Frameworks replace legacy CMS systems that provide a bare bone set of tools. The agility, performance, and maintenance of a ground up site using frameworks like Symfony or Rails leads to longer development cycles and increased costs but there is a point where you stop putting a round peg in a square hole and realize that the cost of maintenance and development down the road will far exceed the up front investment.
I'm glad Drupal has been able to take us this far and I'm very grateful for the community behind it. It's a pleasant surprise that everything is still working and hopefully we can get a few more miles out of it before we are ready to do the next iteration of the site.
SQL script to grab the worst performing indexes
by Clay vanSchalkwijk on February 11, 2009
I have been doing a lot of auditing and clean up of database performance the last few days. We are currently in the middle of a migration and with hardware infrastructure in place it is time to go back and see what changes we can do on the code and database side of things to help bring the site up to optimal performance and lower query times. I found this gem in the MySQL Forge site; which turned out to be a great resource for MySQL tidbits.
/* SQL script to grab the worst performing indexes in the whole server */ SELECT t.TABLE_SCHEMA AS `db` , t.TABLE_NAME AS `table` , s.INDEX_NAME AS `inde name` , s.COLUMN_NAME AS `field name` , s.SEQ_IN_INDEX `seq in index` , s2.max_columns AS `# cols` , s.CARDINALITY AS `card` , t.TABLE_ROWS AS `est rows` , ROUND(((s.CARDINALITY / IFNULL(t.TABLE_ROWS, 0.01)) * 100), 2) AS `sel %` FROM INFORMATION_SCHEMA.STATISTICS s INNER JOIN INFORMATION_SCHEMA.TABLES t ON s.TABLE_SCHEMA = t.TABLE_SCHEMA AND s.TABLE_NAME = t.TABLE_NAME INNER JOIN ( SELECT TABLE_SCHEMA , TABLE_NAME , INDEX_NAME , MAX(SEQ_IN_INDEX) AS max_columns FROM INFORMATION_SCHEMA.STATISTICS WHERE TABLE_SCHEMA != 'mysql' GROUP BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME ) AS s2 ON s.TABLE_SCHEMA = s2.TABLE_SCHEMA AND s.TABLE_NAME = s2.TABLE_NAME AND s.INDEX_NAME = s2.INDEX_NAME WHERE t.TABLE_SCHEMA != 'mysql' /* Filter out the mysql system DB */ AND t.TABLE_ROWS > 10 /* Only tables with some rows */ AND s.CARDINALITY IS NOT NULL /* Need at least one non-NULL value in the field */ AND (s.CARDINALITY / IFNULL(t.TABLE_ROWS, 0.01)) < 1.00 /* Selectivity < 1.0 b/c unique indexes are perfect anyway */ ORDER BY `sel %`, s.TABLE_SCHEMA, s.TABLE_NAME /* Switch to `sel %` DESC for best non-unique indexes */ LIMIT 10;
To audit just one database if you are running on a server with several different databases, just adjust the where clause to WHERE t.TABLE_SCHEMA = 'mytable'. This would have been very useful when working with the cluster to recover memory/space from indexes that aren't being used and to optimize queries to hit indexes that are meaningful.
MySQL Proxy Quick Start
by Clay vanSchalkwijk on February 10, 2009
There isn't too much out there for easy reliable mysql load balancing. I recently built a master/master mysqld setup and needed automatic roll over on the software side. After going through the options I ended up going with MySQL Proxy 0.60 for low overhead, ease of setup, and ease of maintenance. I installed the proxy on each of my web servers (already load balanced through hardware) and configured the code to just use the local proxy. A quick start for RHEL systems, you can get it up and running in just a few simple steps:
- Download the latest version at http://dev.mysql.com/downloads/mysql-proxy/.
- yum install glib2-devel.x86_64 ncurses-devel.x86_64 libevent-devel.x86_64 mysql-devel.x86_64
- ./configure -without-lua (for a straight proxy, you will not need LUA and this will save you having to install additional dependencies).
- make && make install
Easy enough to get it installed in the system, but I noticed that the default parameters and as with most software is rather wide open so be sure to lock down the ip addresses you are listening on.
mysql-proxy --proxy-backend-addresses=192.168.1.101:3306 --proxy-backend-addresses=192.168.1.102:3306 --admin-address=127.0.0.1:4041 --proxy-address=127.0.0.1:4040 --proxy-skip-profiling --daemon
The admin address and proxy address will be accessible by anyone hitting your server unless you lock it down to your local host. Now configure your web servers to use localhost:4040 to connect to MySQL and all should be working fine. You can find additional configuration parameters, admin commands, and LUA scripts at http://forge.mysql.com/wiki/MySQL_Proxy.
There and Back Again, an EC2 MySQL Cluster
by Clay vanSchalkwijk on January 26, 2009
Limitations of EC2 as a web platform:
- Price- An m1.xlarge instance will run you ~$600 with data transfer costs. Managed hosting solutions run cheaper especially if you plan on purchasing in bulk. The grid is designed for on-demand computation and not as a cost efficient web services.
- Configuration- There are a limited number of options and you will not be able to tailor the hardware to your application. Databases over 10GB of size will have performance issues since that is the memory cap.
- Network storage- The primary disks offers limited storage, additional volumes will need to be attached across the network and at an additional cost.
- Software - No hardware based solutions for load balancing or custom application servers. The model is software driven so all needs must be met with a software solution. In a managed hosting or collocation solution you will at least have the option of adding additional hardware and having a private network. No dedicated switching, routers, firewall, or load balancers.
EC2 might be right for you if:
- Distribution Awareness- Your application was designed to scale horizontally from the get-go and you can take advantage of grid computing.
- Research and Development- EC2 & Rightscale will allow for you to bring up new servers, test configuration, and scale quickly. If you are not sure what your hardware demands will be or the scope of the project it will allow for some flexibility to get this right before committing to rather lengthy contracts with other hosting options.
- Disaster Recovery- If you need an off-site mirror for your site that you can keep dormant and activate as needed.
Over the past several months I have been doing extensive development using Amazon's EC2 as my hardware infrastructure. I was tasked with taking CitySquares.com from a New England area hyper-local search and business directory to a national site in a few months. Due to the memory limitations of EC2 instances, m1.xlarge only providing 15GB the jump from a 15GB database to anything larger becomes very costly. When everything was able to be contained in two servers in a master/slave environment we were able to provide redundancy, performance, and easy management when working with the database. The final estimations of the national roll out would put our core data at 50GB. Far too large for any one EC2 instance. Going to disk was not an option as everything works off of EBS attached storage and that any disk writes means traveling over the network. An additional overhead which degrades performance even more when switching off RAM. Then, there was also the nature of the data, which means that at any page load, any piece of data could be requested.
With OS overhead, index storage, and ndb overhead, each x1.mlarge instance gave about 12GB of usable storage. Include replication, and it comes down to 6GB of storage per node. To store a ~50GB database that I had planned on requires 8 storage nodes, two management servers, and two mysqld api servers. This is where it became important to understand the advantages of vertical scaling versus horizontal scaling. EC2 provided fast horizontal scaling and configuration. Servers can be launched on demand and their configurations scripted. While I appreciated that aspect of cloud computing and being able to bring that many servers up and configure each one relatively quickly I really just needed two decent boxes with 64GB of RAM in each and a master/slave setup. The operations costs for the cluster was $6000/month, a hefty bill considering I could buy all the hardware needed to run the cluster in just a few months of paying for EC2.
Twelve servers later and a working cluster we were able to successfully roll out our MySQL Cluster with minimal performance loss. A lot of it was due to tweaking every query top to bottom. The site runs on a Drupal core, which meant a lot of the queries were not designed with distribution awareness both from code in house and the core. This was another added growing pain since the network overhead of running 12 servers on shared resources, with mediocre latency, and throughput limitations was amplifying flaws in the database design and every poorly designed query and join degraded performance significantly.
To give credit where credit is due, EC2 did allow us to scale the site up rather quickly. We were able to test server configurations, new applications, and have easy management. It would not have been possible for us to push out the data, handle influx of new traffic, and expand as fast as we had without it. Long term however, it made absolutely no sense that once we were finished scaling up, to stay on EC2. It is a great platform for start-ups to be able to configure and launch servers for their service or application and grow rapidly.
As a database platform, until EC2 offers more configuration options in it's hardware and the ability to increase memory, the cap of 15GB will make EC2 problematic for any database that plans on growing past that. It is important to understand your application and database needs before considering EC2. It is no surprise to me that even with the RightScale interface, and easy management of EC2 web sites are reluctant to switch off their own hardware or managed hosting.
Both Sun and Continuent are pursuing MySQL Clustering on cloud computing. As of this post, Continuent for EC2 was still in closed beta and testing and Sun is doing their own research into offering more support for database clusters on compute clouds. Maybe in the future this can be something to revisit, but it would require more ndb configuration options on the network layer to cope with shared bandwidth and additional hardware configurations by Amazon (someday).
The 6.4 release which is in beta now offers new features which would make MySQL Clustering more attractive on cloud computing architecture:
- Ability to add nodes and node groups online. This will allow the database to scale up without taking the cluster down.
- Data node multithreading support. The m1.xlarge instance comes with 4 cores.
If you are setting up a MySQL cluster, the following resources will help get you up and running quickly: