Month: June 2012

Thoughts on AWS

This is something that I’ve wanted to post about for a while. Of course, there is no shortage of opinions on AWS, ranging from someone running a few instances to Netflix. In my last company, I’ve spent the last 2 years living with AWS every day. A little bit about the actual environment:

    • LAMP stack, although the “A” has been partially swapped out for Nginx
    • JVM/Tomcat/Mongo stack
    • Between 200-400 instances, ranging from small to 4XL.
    • ~25 million requests a day, with a lot of heavy queries. Peak load is ~1000 reqests/second.
    • Memcache/Hazelcast for caching
    • Lots of everything else

The tools that are used to track/deploy/monitor all of this are:

 

Some of features and servers of AWS that were used:

  1. ELB – there isn’t a whole lot to say about it. It’s quick and dirty and fairly effective. If you need to quickly load balance across a few instances, it does the job and does it well. With anything more complicated, it’s not really up to par. That generally holds true across most AWS functions. They are simple, relatively easy to get off the ground, but once your demands change, you will typically need something else. For my infrastructure, nearly everything went through Haproxy and you get much more granular control over load balancing. If you want to do anything like rewrites or rule-based load balancing, ELB isn’t really an option. It’s also a little faster to add/remove servers out of the pool with Haproxy. You also get much better visibility into what’s going on, then you will with ELB. The one big caveat with Haproxy is that it will not do SSL termination. Stunnel is a popular option, but we chose to terminate SSL with Nginx.
  2. CloudFront – this is Amazon’s CDN. It’s not quite as well performing as some of the alternatives, but it’s decent enough.
  3. VPC – this is Amazon’s Virtual Private Cloud. It seems to generate some controversy. I’ve seen a number of blog posts and opinions that it’s really nothing but a crutch for people who are moving from the “enterprise” world, to make it look familiar to them. In a way that’s true, but I do like that option a lot anyway. There isn’t anything that you can do with VPC that you can’t do on the general EC2 side, but it does make some things a lot easier. The ability to do subnets is superfluous in some sense, but it does make it somewhat simpler to manage. The delineation between different areas of your infrastructure becomes much clearer. The ability to manage egress traffic is very useful from a security perspective. Yes, this can be done with iptables on individual instances, but again, the VPC makes it easier. Lastly, if you’re going to deploy any internal corporate apps to EC2, the ability to run a VPN is a useful feature.
  4. EBS – This is probably the most widely discussed and derided part of AWS. We’ve done the usual kabuki dance with striping across multiple volumes, but that’s roughly the equivalent of putting an air freshener next to a pile of dog shit.  The IO remains terrible. Unless you can scale horizontally, EBS will be a bottleneck. Using ephemeral storage is somewhat better, but it’s not a solution. EBS is also expensive. Next to RAM, the storage eats a huge chunk of your budget. You do have to make sure that your snapshots are well managed, because it will get out of hand quickly.
  5. RDS – We’ve tested this about a year ago. At the time it didn’t do as well as dedicated instances.
  6. S3 – it’s useful. We’ve primarily used it for deployments, where Hudson would upload the build to a bucket and the instances would pull the latest code version from there.

 

In general, this shouldn’t be a very complicated setup. What made it difficult is a massive and poor PHP code base, which was backed by a myriad of MySQL slaves that included shards, federated tables, ETLs, multiple versions and multiple masters in a hierarchy

So what did we get with this setup? The number one headache in AWS is not even the poor IO. Variable performance and especially variable latency is what’s going to trip up a lot of people, especially coming out of a DC or a Colo environment. In a dedicated place, if your network latency is >5ms it means there is a problem and it will be fixed. In EC2, this will happen constantly and unpredictably. You absolutely have to take that into the application design considerations and you don’t at your peril. My daily mantra was: “Short timeouts and quick retries”, together with “fail quickly and fail open”. Here is a more specific example:

A lot of queries were very very expensive. Rather than spending the time fixing the queries or looking at different algorithms, people decided to bump the timeouts. Now, that’s a poor design choice in any case, but you might be able to get away with it in a Colo. What happens on a LAMP stack in AWS is that occasionally instances will slow down, for one reason or another. If you are running Apache on the front end, you have some finite amount of threads/workers and if they are set to timeout in say 2 minutes for MySQL connections, a single slow query will quickly back up the entire site. Web servers will start running out of threads, since they are coming in faster then they timeout you find yourself in a site down situation. It becomes a rather nasty cascading effect. 

On the cost side, I haven’t crunched the numbers completely and it’s more of a casual observation. AWS is expensive. That does come with a few caveats. If you’re running 0-10 instances, AWS is great. If you’re Netflix and you need to spin up 1000x instances for a crunching job, then AWS will be terrific. If you have highly variable traffic patterns and you might need to scale up quickly, then that’s certainly a sweet spot for AWS. Between these though, there is a wide range of experiences. Purely on a back of an envelope calculation, in my prior company, we were running in a colo with leased dedicated hardware, basic remote hands and shared storage. That setup performed better at perhaps 50-75% of the cost. Of course getting reserved instances will reduce the price significantly, especially over the 3 year term. The problem with that route is that it removes a lot of flexibility, with is the selling point of AWS to begin with. Not to mention the fact that reserved instances can only be done per zone.

In conclusion:

  • SmokePing was indispensable for flushing out latency problems.
  • Try to reduce the data velocity within the infrastructure. The less replication/syncronization that has to be done, the less the chance for problems.
  • Swapping Nginx for Apache was a big gain. Even for a site that wasn’t heavy on static files, the ability to break out PHP allowed independent scaling of that tier and easier troubleshooting.
  • Don’t try to fix broken instances. Kill them and relaunch. This is where you have to have absolute confidence in your monitoring and auto configuration (Puppet/Chef).
  • Unless you use EC2’s strengths, you’re likely to get better performance/cost in a colo.
  • Make sure that all services are idempotent and fail quickly.
  • Anything that’s not thread based (nginx, haproxy, node, etc) is probably going to do better in EC2.
  • Go for many small instances vs a few large ones if you can. This will scale better and make your infrastructure more resilient. If you use EBS-backed instances, you’ll have the option to scale up anyway.