A long and a very informative post about what TripAdvisor found out when they tested out AWS for their infrastructure. There are a lot of interesting tidbits in there, some of which are hard to analyze without seeing precise numbers. What I find interesting is that they essentially ported their existing datacenter setup to AWS. Granted, their stated goals were to really look at a cost/performance and not change the operational model. However, in my experience with AWS, simply reusing your datacenter architecture isn’t sufficient and will likely lead to a lot of disappointments. There are couple of things that stood out that would have likely improved their experience:

  • “Cloudwatch/monitoring was sufficient” – that was said with a caveat that it was enough for scaling decisions and detailed monitoring would be more helpful. I would disagree there. Even in their results they didn’t have enough visibility to figure out what was wrong with GC, so they couldn’t see inside the JVM. As far as scaling decisions go, it depends on the complexity of the application and underlying architecture. If you can make the decision simply based on the CPU load of a given instance, then CloudWatch is great. However, in a lot of cases you need far more detail to understand which tier to scale and if that’s even going to help. Also, depending on the availability tolerance of your application, 5 minute intervals might not be good enough.
  • Log collection – that seems to be done in pretty antiquated way and clearly it’s not real time and heavily dependent on local instance storage. Something like Graylog2/Logstack or Flume/Hadoop is far better.
  • Configuration management – they use a custom in-house solution with a naming database. That is usually very difficult to change for historical reasons, but something along the lines of puppet/chef/salt will give better results. The process is somewhat reversed too with an instance responsible for figuring out what it needs to be, though it’s arguable which is the better approach.
  •  Use of ELB – ELB is relatively cheap and pretty fast. Using something like HAProxy would give them far more granularity, visibility and better balancing overall.

In any case, it’s a worthy read if you’re considering AWS.