Running your own hardware Vs EC2 and RightScale — Part 2
by Justin Leider on September 16, 2008
This week I've been reminded of a very important lesson... No matter how abstracted you are from your hardware, you still inherently rely on its smooth and consistent operation.
This past week CitySquares' NFS server went down for the count and was completely unresponsive to any type of communication. In fact, the EC2 instance was so FUBAR we couldn't even terminate it from our RightScale dashboard. A post on Amazon's EC2 board was required to terminate it. Turns out the actual hardware our instance was running on had a catastrophic failure of some sort. Otherwise, at least so I'm told, server images are usually migrated off of machines running in a degraded state automatically.
Needless to say, the very reasons for deciding against running our own hardware have come back to plague us. Granted we weren't responsible for replacing the hardware but we were still affected by the troublesome machine. We weren't just slightly affected by the loss of our NFS server either. Since we are running off of a heavily modified Drupal CMS our web servers depend on having a writable files directory. As it turned out Apache just spun waiting for a response from the file system, our web services ground to a halt waiting on a machine that was never going to respond... ever. Talk about a single point of failure! A non critical component, serving mainly images and photos managed to take down our entire production deployment.
This event has prompted us to move forward with a rewrite of Drupal's core file handling functionality. The rewrite will include automatically directing file uploads to a separate domain name like csimg.com or something similar. Yahoo goes into more detail with their performance best practices. However, editing the Drupal core is generally frowned upon and heavily discouraged since it usually conflicts with the upgrade path and maintainability of the Drupal core becomes much more difficult. While we haven't stayed out of the Drupal core entirely, the changes we have made are minor and only for performance improvements. I believe it is possible to stay out of the core file handling by hooking into it with the nodeapi but it seems like more trouble than its worth.
The idea behind the file handling rewrite is to serve our images and photos directly from our Co-Location while keeping a local files directory on each EC2 instance for non user committed things like CSS and JS aggregation caching among other simple cache related items coming from the Drupal core. This rewrite will allow us to run one less EC2 instance, saving us some money as well as remove our dependence on a catastrophic single point of failure.
For the time being we have set up another NFS server. This time based on Amazon's new EBS product. I spoke about this in a previous post. One of the issues we had when the last NFS server went down was the loss of user generated content. Once the instance went down all the storage associated with that instance went down with it. There was no way to recover from the loss, it was just gone. This is just one of the many possible problems you can run into with the cloud. While on the pro side, you don't have to worry about owning your own hardware, the con side is you cant recover from failures like you can with your own hardware. This is a very distinct difference and should be seriously considered before dumping your current architecture for the cloud.
2 comments
Nice to see some bad experiences with the cloud. Unfortunately everyone seems to think they are bulletproof. Researching methods on which to build fully redundant cloud infrastructure is always interesting.
Unfortunately, the hype about EC2 or any other cloud or grid type infrastrucutre does not take into account the fine details. Everyone (developers and CTOs) just think slapping it all on the cloud is great.
Things like EBS/XFS kernel module bugs that do not allow instances to remount EBS volumes after rebooting never get taken into account. Then digging to determine what DOESN’T have problems, becames a spiralling, looping journey through Amazon forums and Google groups.
Making it all fully redundant in the cloud is quite complex, the 24 page slideshows do not go into that. Complex and costly… as you add this functionality and that functionality, branching here and there.
Building in the cloud is not the problem….. recovering in the clouds way it goes wrong is… because of some kernel module or likewise which is somewhat out of the controller of the maintainer. Anyway, I just wanted to say thanks for sharing your experiences.
PS – EBS and XFS? Check your kernel..
dmesg | head -n 1
If it is Linux version 2.6.21.7-2.fc8xen-ec2-v1.0 then perhaps start looking through some of the following threads as it seems it may just happen again.. even with EBS (not certain in the EBS boot scenario).
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=28968
http://developer.amazonwebservices.com/connect/message.jspa?messageID=151961
http://groups.google.com/group/ec2ubuntu/browse_thread/thread/0cfca179e77a880f?pli=1
The latest Ubuntu kernel has the XFS module compiled into it so those problems should not occur on it.
I find it hard to get my mind around running production appliances on EC2, when EC2 itself is a single point of failure.
Is there a storm brewing in the clouds..?
by earthgecko on March 15, 2010 at 11:31 am. #
Hi Justin- just a thought- http://www.subcloud.com might be a better solution instead of an ec2 instance + nfs, in terms of both cost and reliability (i.e, no need to run an ec2 instance for nfs)
by Randy Rizun on October 7, 2008 at 9:18 pm. #