High Availability

To provide highly available networks, we need two things; standby hardware, and good network design principles. No amount of duplicate hardware will help you if the network doesn’t recover from a failure, and you can’t get access to the wiring closet to bring up a cold-spare; just the same good network design cannot account for hardware failures or accidental cable disconnections, unless you have duplicate hardware waiting in the wings for just such an occasion.

What we look for with good network design principles is networks that automatically detect problems, and automatically recover from these faults as rapidly as possible. Most of us consider human intervention to mitigate a fault as inelegant; although for now us humans are still required to actually fix these faults, even though our networks can work around them.

This series of articles is going to look at High Availability in five ways:

  • Local switching
    NIC bundling, Spanning Tree Protocol and LAN Best Practices
  • Local routing
  • Campus (between local buildings)
  • Wide (between cities)
  • Internet

Each of these approaches is related to the others, but for the most part can be deployed without consideration for the other.

Before we get going, I’ll go over some key terms that are going to figure highly in these articles.

Failure Mode

The failure mode is the state to which the system (that can include firewalls, routers, switches, servers and client hosts) falls back to in the event of a network problem. This is probably the most important part of how we design networks, because we have to consider the state of things in the worst-case-scenario.

It might be acceptable to your client if the network is still usable, only with reduced performance; or your client may require 100% performance 100% of the time; or your client may not want to pay any extra money at all, in which case you can define the processes for manual intervention to resolve network failures.

Over-subscription

The idea behind over-subscription is that not everyone will be using a resource at the same time. Instead of just adding up the requirements of every user and building a network from that, a network engineer can assume some fluctuations in demand and design a system that has meets the requirements of the user-base as a whole.

Cold Spare

A cold spare is a duplicate device that is not powered up and sits on a shelf. An administrator must install the device, and possibly configure it to replace a failed unit. As long as the spare device is appropriately chosen, it may act as a spare for many other devices.

Warm Spare

A warm spare is a device that is powered up and configured, and while not currently in use it is ready to come into use as soon as a failure mode is detected. This should not require human intervention, although performance may not be optimal so an administrator would have to review the fault and repair it.

Hot Spare

A hot spare is a device that is powered up, configured, and currently in use. An optimal design will see the hot-spare not exceed 50% capacity, so in a failure mode the device will only be asked to handle its maximum load. This should not require human intervention, although performance may not be optimal so an administrator would have to review the fault and repair it.

Redundant Hardware

I don’t like the word redundant in this context, because it has a negative connotation and it isn’t exactly appropriate.

One dictionary definition of redundant is “Exceeding what is natural or necessary”, so you could say that redundant hardware is more hardware than is necessary at a minimum for the network to run. That said, the bare minimum probably does not meet the availability needs of your client, so any extra hardware is not actually redundant, but necessary.

The phrase “redundant hardware” isn’t appropriate in some cases, because often we design duplicated, fault-tolerant hardware in which both devices are in use all the time — so neither device in that case is redundant.

So what is more appropriate? I’ve been using the phrase “duplicate hardware”, and “fault-tolerant network designs” but really, there isn’t any other suitable word. Until someone coins a phrase that makes more sense, we can continue with this one — but make sure your clients understand what they’re buying into.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s