Category: IT

IT at Intuit

This is doing it right. CIO establishes the culture, the processes are developed and the result is great. Small cross-functional teams that are able to deliver results and iterate quickly.  This overlaps with a lot of devops concepts, like agility, breaking down silos, etc. It’s a great example how it can work well even when applied to enterprise IT.

http://www.computerworld.com/s/article/9232594/Intuit_forces_IT_engineers_into_room_until_they_get_it_right?taxonomyId=237&pageNumber=1

 

CERN & Puppet

A presentation from CERN during the PuppetConf. Some very interesting items in there:

  • I was somewhat surprised at how much diversity they have. I thought they ran what effectively is a grid compute network with identical nodes.
  • They operate at huge scale. It requires a completely different way of thinking about power and data and resource management.
  • “Evaluate solutions, identify functional gaps and challenge them” – a very succinct way to describe a core IT function.
  • I like the analogy of thinking of your machines as pets and cattle. You care for your pets, but you shoot your cattle if something is wrong. Your infrastructure should be made out of “cattle”.
  • Their tool chain (puppet/foreman/openstack/mcollective/bamboo/git) is accessible to anyone and they understand the value of active community.

The overview is here. If you want to skip the CERN background, the technical part of the talk starts at ~11:00 minutes. This follow up talk gets into more technical detail of their puppet use.

 

Cloud Load Balancing, Part II

I wrote a post yesterday somewhat criticizing the statement by Radware. There was a very legitimate comment by the author asking about details, which I didn’t provide. This response is in a separate post, because it will get long. With a caveat that my knowledge of their product is minimal, here are the details:

  • key differences between a shared, cloud load balancer instance – offered by virtually all cloud providers (i.e. Amazon ELB, Rackspace CLB)” – that statement is misleading at best. AFAIK, Amazon didn’t disclose the architecture of their ELBs. It’s quite possible that they are shared, but that can’t be claimed for certain. In any case, it’s irrelevant. What matters is performance and features. ELB features are bare bones, whereas the performance is debatable and any argument in that regard should be test/data driven.
  • When a load balancer fails, a new one with an identical configuration takes over” – that really depends on the distinction between DR and high availability. Amazon doesn’t provide SLAs for their ELBs, but you could run a single ELB in multiple availability zones, multiple ELBs or ELBs in different regions. In these cases failover would typically be handled by DNS either by distributing multiple A records initially or updating DNS based on failure (albeit that depends on your RTO tolerance). There are other more obscure DNS methods as well. If you’re going with an HAProxy approach, then your failover method likely includes a monitoring daemon (for logs, service state, etc) and kicking off an API call that at a minimum includes DisassociateAddress and AssociateAddress.
  • “a failure induced by any tenant can cause a broader failure impacting multiple tenants, (i.e Amazon ELB failure June-29th, 2012)” – In theory this might be true, but in practice its false. In almost any major cloud you’re in a shared environment. Typically another tenant can affect your workload, but if someone’s failure can impact another tenant, then it’s a security breach of gigantic proportions. More importantly: Amazon’s ELB failure on June 29th had nothing to do with shared tenants whatsoever. At least as far as ELB is concerned it’s a false statement. It was a bug in AWS and that can happen with anyone’s offering.
  • “the need to redesign the application due to lack of advanced…functionality” – compared to ELB, that’s true (if you need these features). However, as with most AWS functions, they don’t claim more than what they do. If you use HAProxy, nginx or another load balancer you’ll get all of this functionality and more. If you’re willing to pay the price you could even run a Netscaler.
  • lack of control over the load balancer performance and capacity” – needs proof with tests/data. Again, with something like Haproxy you have full control, though your performance may be affected
  • inability to define custom health monitoring” – with ELB the functionality is limited, though you could hit an http page that executes a custom written and a more sophisticated health check. That does require more work. Again, I might sound like a broken record, but Haproxy and others will load balance whatever you want with very complex checks.
  • “inability to load balance and optimize application delivery across multiple data centers” – this goes back to an old debate about GSLB and all the issues associated with it. However, that is largely true. Balancing across regions can get complicated, but in my experience it’s driven more by the application data model rather than load balancing itself.
  • The ADC’s enterprise features alleviate all the shortcomings of cloud based load balancer” – honestly, I am not even sure what to say here. FUD * Marketing-talk.
To be perfectly fair here, I don’t want to come off as a staunch defender of AWS. It has a number of significant shortcoming, some of which I’ve written about before. By no means is it perfect for everyone and anyone considering deploying a significant presence in there (or any other cloud provider) should do their own research. Having said that, there are enough legitimate criticisms and no need to resort to FUD. I haven’t touched a Radware product in close to a decade, but I do recall that I had good impressions. You could take an Alteon VA and run it in a colo or your datacenter or on a virtualized platform and load balance your cloud presence. That’s a valid approach and may work for a lot of people. However, my guess is that most customers would probably be better off with some detailed technical analysis, performance tests & data and some thought through technical whitepapers and diagrams.

 

Cloud Load Balancing

A complete “straw man” argument made on Radware’s blog. I understand they have to promote and sell their product, but it should be able to (and can) stand on it’s own merits. He is talking about ELB and it’s purported failures, but there are plenty of ELB alternatives (like HAProxy), not to mention multiple strategies for avoiding a single point of failure at the LB.

TripAdvisor’s architecture

A long and a very informative post about what TripAdvisor found out when they tested out AWS for their infrastructure. There are a lot of interesting tidbits in there, some of which are hard to analyze without seeing precise numbers. What I find interesting is that they essentially ported their existing datacenter setup to AWS. Granted, their stated goals were to really look at a cost/performance and not change the operational model. However, in my experience with AWS, simply reusing your datacenter architecture isn’t sufficient and will likely lead to a lot of disappointments. There are couple of things that stood out that would have likely improved their experience:

  • “Cloudwatch/monitoring was sufficient” – that was said with a caveat that it was enough for scaling decisions and detailed monitoring would be more helpful. I would disagree there. Even in their results they didn’t have enough visibility to figure out what was wrong with GC, so they couldn’t see inside the JVM. As far as scaling decisions go, it depends on the complexity of the application and underlying architecture. If you can make the decision simply based on the CPU load of a given instance, then CloudWatch is great. However, in a lot of cases you need far more detail to understand which tier to scale and if that’s even going to help. Also, depending on the availability tolerance of your application, 5 minute intervals might not be good enough.
  • Log collection – that seems to be done in pretty antiquated way and clearly it’s not real time and heavily dependent on local instance storage. Something like Graylog2/Logstack or Flume/Hadoop is far better.
  • Configuration management – they use a custom in-house solution with a naming database. That is usually very difficult to change for historical reasons, but something along the lines of puppet/chef/salt will give better results. The process is somewhat reversed too with an instance responsible for figuring out what it needs to be, though it’s arguable which is the better approach.
  •  Use of ELB – ELB is relatively cheap and pretty fast. Using something like HAProxy would give them far more granularity, visibility and better balancing overall.

In any case, it’s a worthy read if you’re considering AWS.