When you're in enterprise I.T. you care a lot about those words--sometimes too much.
Like most I.T. shops, we have a very complex environment: multiple hardware, os, database and programming languages. You'd expect a fair number of outages.
Some of our biggest outages this year, however, came as a result of (or were made worse by) our attempts to insulate ourselves from outages or to provide better scale:
- The network load balancer to our data center failed. Luckily we were redundant, right? Wrong. The redundant load-balancing element failed, too. BUT it didn't know it failed so the system thought it was operating, but we were actually down. It took us 30 minutes to realize our applications were down because of the failed load balancer.
- We had an eerily similar situation happen with a database load-balancing solution.
- We use a storage area network (SAN) primarily to allow us to scale storage at a cheaper price. These SAN cabinets are big iron city. Guess where four of our biggest outages this last year happened. That's right: our SAN. We've since moved to a new vendor for storage.
So what do you do? Introducing additional protection introduces additional points of failure. Is it worth it?
What's your experience?
Strange - to my thinking "additional protection" should involve reducing the number of points of failure; primarily eliminating single point of failure (SPOF) elements, but always providing ways of maintaining overall system availability. Additionally, the fact that your monitoring tools are giving you faulty information when you encounter a double failure (the load-balancing examples) should be a giant red flag IMHO.
ReplyDeleteJust my $0.02.
Redundancy is big in our IT "shop." I do incident tracking for our SAP systems and we began to notice a trend not only in SAP, but across our other IT services, of failed fail-overs. Right now we're in a study to see how to improve fail-over. But there are a lot of incidents where fail-over works and the end users never know ... so in my opinion it's worth it. The biggest cause of incidents and outages for us is process & procedural errors ... implementing changes. For what it's worth, the biggest outage that we've had (at least on my watch) was a SAN. We knew it wasn't reliable and we were planning to migrate off of it, but the weekend before the migration, one of the motherboards bombed and caused a 3-day outage for our dry goods tracking system. The lesson learned was spend the money on reliable systems. The bum SAN was the decision of a former manager who was trying to save a buck.
ReplyDeleteI think anyone creating a disaster recovery or redundancy plan needs to perform a thorough risk analysis. Sometimes the added complexity isn't worth the risk. The question should be asked, "Is the cost of an outage greater than the cost of implementing complex redundant systems?" I think there is a balance that can be met but it takes thought and experience.
ReplyDeleteOne of the things not mentioned is failover testing. We used to have issues with failovers not happening as planned due to a variety of reasons. Now our operations team regularly tests all redundant circuits and reports with any found issues. This not only gets the redundant paths/equipment tested, it also gets the Ops folks much smarter on the network topology as well as familiarity with the different methods of failover from a network perspective.
ReplyDeleteTime intensive? Possibly, but I'd rather spend 30 minutes in a dark window troubleshooting issues than during the production day with everyone clammoring for network status. It's an investment in quality sleep time... :)
In my former life (before retirement) I consulted with IT departments of large companies on "high availability." Once upper management agreed on the need to investigate, one of my approaches, early on, was to determine what an outage really cost them. It involved determining tangible (measureable) things such as order entry people trying to take orders without computer assistance, warehouse people waiting for orders to pick, loading dock people sitting around for product to ship, and so on. The intangible consequences were the number of unhappy customers calling a competitor because they couldn't place an order (this was especially true in a high-volume "mushroom company" I dealt with... a highly perishable product in a highly competetive business).
ReplyDeleteFor those who tried to get by "on the cheap," I included a section on total cost of ownership (TCO) which not only included purchase/lease costs of equipment but also maintenance costs, power consumption, air conditioning costs, cost of staff required to handle operational overhead.
I can recall one estimate I did of money lost due to an outage was in the range of $35,000-$50,000 per hour. (Your mileage will vary.) It shocked the management into considering redundant systems; upgrading old, very power hungry equipment to newer models with smaller footprints, which ran cooler (less A/C), more efficiently (less power), etc.
Costs to run a server "farm" can be expensive, especially if the staff to maintain and service the center keeps increasing as you add servers. Labor costs are an on-going expensive liability. (Nothing new in that statement.)
SAN's are a great investment but the more disk arms involved means the greater chances of a failure. Some sort of RAID implementation is essential. RAID 5 (striping) is a good compromise of cost vs. recovery. RAID 1 (mirroring) is more expensive but recovery time is dramatically reduced.
Many books have been written on this subect and many ideas can be brought forward for consideration. But it boils down to this: "What is my business worth and how much 'insurance' (i.e., redundancy, backup, fail-over systems, etc.) can I afford to keep it running and still provide the level of service I promised my customers/clients?" ---which leads to the next topic: Service Level Agreements (SLA)... for another time.
These types of experiences raise the issue of whether IT is focused on managing "systems" or "services." We IT types have historically focused on the performance of specific system components (servers, network, storage, etc.). Our customers, on the other hand, are really only interested in the services we provide them (messaging, accounting records, document storage and retrieval). There are solutions now appearing in the market that are supposed to monitor "service availability" from the customer perspective. Perhaps something like that would guard against multiple failures going undetected, as Joel described. (Should we assume that some users noticed there was a problem with their "service," before IT detected the problem with their "system"?)
ReplyDeleteThis phenomenon is actually quite common. Indeed, adding redundant components frequently increases the risk of catastrophic failure.
ReplyDeleteI recommend reading a treatise called "The Problem of Redundancy Problem" by Scott Sagan. You'll find out what jumbo jets, an erstwhile prime minister of India, and your IT systems have in common.
http://iis-db.stanford.edu/pubs/20274/Redundancy_Risk_Analysis.pdf
In my experience computer hardware is like race car hardware - the harder you push it the faster it breaks (specifically disks). My solution to reliability to to over-provision hardware so it rarely exceeds 50% utilization. Kind of like an F1 racing at Nascar. If disks are reading and writing at capacity 24/7, they will wear out much faster than if they run at less than 50% capacity. For disk reliability and scalability I prefer RAID 10 with disks from different manufacturers on each RAID 0 set.
ReplyDeleteI agree with Phillip Cox that fail-over testing is essential. It is our policy that we use our fail-over systems during all system upgrades. We fail-over, do the upgrade of the primary system, and return the primary system back to production, then upgrade the fail-over system. You kill two birds with one stone by testing fail-over and upgrading in the same process. This is basically training the staff to deal with system failures as a normal upgrade (which would have to occur anyway to fix whatever broke.)
People. Process. Technology.
ReplyDeleteWe've all heard it before - I think - yet we often fail to put systems and methodologies in place that actaully monitor all three. Sometimes it isn't about adding more protection and complexity, it's about simplifying and watching what exists more closely or more effectively to see when there are symptoms and recover more quickly.
You also have to be willing to invest in all three. Skimping on one usually means you'll be reinvesting back into one of the other areas - skimp a few dollars of the technology and you'll likely invest in people and process to keep it in play. Neglect the growth of your staff and there's a good chance you'll be adding process or technology attempting to compensate. Keep in mind that symptoms don't always point you to the area that is the REAL cause.
So what do you do to break the cycle? Take steps to understand where you are today. Are your existing process effective in supporting your business objectives. Are your people able to use them effectively in their roles. Examine the source of your outages - I mean the REAL source. Just like any self-critique, you're going leanr stuff you don't like and aren't fun to admit, but it's the only way to really get on the path to improvement.
I don't want to ramble on here so lastly I'll just say, don't be afraid to get some outside help. Being the "self-reliant" culture that we are it's not something that we're very good at doing or knowing when to do it. That "objective" perspective can often provide valuable insight that can be learned in no other way.
I recently had a personal experience that taught me something about redundancy. I had an issue a while back when installing a beta version of a "newly released office system" *ahem* When installing it asked me to reboot and I blue screened. I was extremely distraught. I had all our family photos, etc. I wans't sure what to do. It just so happened a month prior I was testing a product that basically does and ISO of your system and restores it exactly to that point. I searched all over and found that simple test file I had created on my system. It was a lifesaver.
ReplyDeleteI have since come to realize that the importance of having things in place to recover information is extremely important even on a personal level.
I'm a Unix guy and as you probably know, Unix/Linux over abounds in tools.
ReplyDeleteOne tool I always run in my box is gkrellm, which incidentally has a plug-in to keep track of running servers. If they fail (as in a server becomes unavailable), I'll see it on the screen as well as I get a screen pop up or I can set it to send me an e-mail.
So far it has never failed me, I'm always the first one to know if something "dies on action" and can take counter measures.
Sometimes the most effective tools are the simplest of them all ;-)
I've found that in our data center it is always the thing we least expect to fail that does. That means that while redundant systems are good to prevent against foreseeable problems, they won't protect you against the problems you didn't anticipate.
ReplyDeleteI recommend really good centralized monitoring software, so that you can determine very quickly what the less-anticipated problems really are, so you can fix them and get back online as fast as possible.
You said: "It took us 30 minutes to realize our applications were down because of the failed load balancer." You could have reduced your downtime by bunch of minutes if you had scripts or monitors watching your load balancer. This sort of solution is generally much more affordable than the "figure out if the load balancer is less than robust and replace it with a more robust system if it is" situation.
Testing! You can plan, prevent, and pretend to protect all day long! I've been working in HA environments for nearly 10 years now and if there is one lesson I've learned it's that you must test your failover/loadbalancing mechanisms. You typically test this behavior at the time of implementation. But what about 60, 120, 240 days into it? Regular, scheduled, failure testing is the only way I know to minimize (notice not eliminate completely) any risk in your HA environment. The second thing I've learned is people are the biggest problem. Your change management process needs to be disciplined, SA's need to trust and use it (one of the biggest challenges), and plans need to be made, tested, and followed. Most technical type IMHO are process oriented and if you can get them to trust the change process and follow it, it too goes a long way.
ReplyDeleteIt certainly is worth it, if done right. The fact that it took 30 minutes to figure on the load balancer leaves me to question both the competence of staff as well as your monitoring capabilities or lack thereof. But i digress:
ReplyDelete* Written properly, monitoring scripts and scenarios can pinpoint easily the failure in your applications/systems. You need to monitor not only your balanced app, but also each individual app instance; doing so may have helped pinpoint failure at the balancer.
* How is moving to a new vendor going to solve a poorly-architected system? Most SANs I've ever known of are out of the box crazily-redundant. Either your datacenter was hit by a nuke, or a monkey architected your SAN, plain and simple.
[Joel: Turns out the vendor agreed the SAN's were bad. We tried new drivers, new hardware components, etc, etc. Finally they tried to upsell us to a higher quality SAN at a deep discount.]
The only thing worse than not having redundancy is thinking you have it and then relaxing.
ReplyDeleteUnless fail-over configurations are tested frequently they will not work when you need them.
>>Finally they tried to upsell us to a higher quality SAN at a deep discount.
ReplyDeleteYou gotta love when they agree their own product is junk ;) Hope everything worked out.