(mis)Adventures with Spanning Tree


I know I’ve gone on and on about STP (Spanning Tree Protocol) but something happened today that reminded me of what Tom Jacoby of IOSecure once said to me. I don’t remember the exact phrase, but the general idea was that STP was prone to code error and administrative error, and as much as even a properly configured STP configuration works, not very many people really understand it.

That said, I’ve put together some nice STP setups in my career. Compared to some of the other modern methods we have it is old tech to use it for fault tolerance, but there is something satisfying about building the configuration. Maybe it is because it is hard and complicated — I’m not even sure why I like it.

What is on my mind today was a client who is growing out of a bunch of 3500XLs into 2950 switches. Even the 2950s aren’t exactly new, but this is my most cost sensitive client so I’m a bit hamstrung on the hardware. The configuration has a 3750 (stack of two switches) for the core that provides routing for the network, and a flat network beyond that is pushed to 19 edge switches (3500XLs of varying code levels, and the newer 2950s) — no trunking just access ports to the edge switches, and everything is running STP with the 3750 stack as the root switch.

Now this isn’t a best practices network by any means, but the customer has a legacy configuration to migrate from and it may very well be with them for some time. Part of my job is to provide the guidance to move towards a scalable, stable and supportable network.

We had some problems setting up LACP on the 3500XLs (they only supported Cisco’s proprietary Etherchannel and in a very feature-poor way), but it was working great on the 2950s. My client was so impressed by the idea of having redundancy for the edge switches, that we created backup links for each edge switch (even on the 3500XLs) and let STP sort out the loops where we weren’t using Etherchannel. Some time passed, and things seemed like they were working fine until out of the blue all the 2950s stopped working.


The 2950s just dropped link. My client would connect a different port and it would light up, and then drop the link again — not even a light on the 3750 interface. I suspected some security profile issue, but as there wasn’t any of that configured it seemed like a long shot, so we agreed that my client would revert the 2950s to his stock of 3500XLs and figure out the 2950s later. Unfortunately the problems persisted through the next day so we decided to figure out why the 2950s failed.

The 3750 was reporting weird errors in its log:

Jul 4 22:40:31.438 UTC: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.c6ca.9aff in vlan 1 is flapping between port Po19 and port Fa1/0/6

What the heck? MAC address flapping? There must be a loop somewhere. But because the 3750 wasn’t experiencing any actual CPU pain I left this and focused on the 2950s to see if I could get them working and back into production to save the day.

The 2950 console reported this:


Another weird error! Err-disable state and yet we haven’t configured any security policies configured on the interface — this error message led me to Cisco’s site and an explanation of how the 2950s handle L2 keepalives which is a default configuration (and so doesn’t show up in the configuration). Cisco’s interpretation is that the 2950s put out a keepalive to detect physical wiring loops, but will also trigger in the event of a STP loop anywhere on the network. You can disable the keepalive and get the switches to connect, but that doesn’t make the loop go away so I set to work on finding it.


I went through each of the 3500XLs, and discovered two in which STP had not blocked the second link to the 3750. This is how I figured out which ones were the culprits:

  1. show cdp neighbors
  2. show spanning-tree (brief)

In the first switch, CDP showed the 3750 connected on ports 24 and 48, but STP didn’t show anything for port 48, so I had to assume that it was also forwarding even though it should show a FORWARDING state in that case. In the second switch, CDP showed the 3750 connected on ports 24 and 48, but this time STP actually showed both ports in the FORWARDING state.

To resolve the issue, I just killed the second link. As soon as I did that the MAC flap entries in the 3750 log stopped, and the 2950s were able to connect again.


The moral of this story is don’t put all your eggs in the STP basket — especially with very old switches and very old STP code. Maybe a better moral is don’t implement advanced features with old code — the 3500XLs are doing just fine switching but as soon as we asked them to do something out of the ordinary two of them fell on their faces.

High Availability — LAN — NIC Bundling

The parent article on High Availability.

Switching on a LAN provides some of the most basic network connectivity options, and are often overlooked. Nonetheless most switches (Cisco, HP, Dell and others) support these configurations, but one thing I can guarantee is that you will find limitations on pretty much every platform. If you’re after inter-operability, do your testing so you can understand these limitations.

Bundle Network Links

We want to bundle network links for two reasons; to aggregate bandwidth (two links give twice the packet-passing capacity) and for failover (if one link fails a second is still running).


I discussed LACP in an earlier article, but I would like to go into a little more detail here. Make sure you review Cisco’s documentation on configuring LACP, and the Wikipedia article on link aggregation.

In my experience, I find LACP to be the best solution for link aggregation. It is a common protocol so interoperability between devices is almost always possible and the configuration is sensible enough that you can explain it to a lay-person.

In the example above, we have bundled two physical links into a single logical link between two switches.

LACP Virtual Adaptors

When we bundle network links with LACP, each host creates a virtual adaptor that represents the bundle. For example, on a Cisco switch we can create an interface called portchannel 1, that represents the two interfaces fastethernet0/1 and fastethernet0/2.

In this case, instead of making changes to or examining the configurations of the physical interfaces we can instead work with portchannel 1. Of course you can work with the physical interfaces, but you must take care to make sure all parameters match on all physical interfaces in the bundle.

LACP Load-balancing Flows

LACP is flow-aware, and it can be configured to load-balance based on MAC address or IP address; the default in Cisco switches is to load-balance based on MAC address.

Load-balancing only really works when the system is able to identify many unique flows; as each flow is established it is put on one of the bundled links and all subsequent traffic also follows that physical link.

Be aware, that load-balancing based on MAC address (the default behaviour) may not be what you want — if your traffic crosses a router the original source MAC address will be obscured.

In a routed environment (if you’re using VLANs) you will find that any traffic that crosses a routing boundary will have its source MAC address replaced by the router MAC address. This can make many hosts appear (to LACP) as if they’re coming from a single MAC address and will definitely skew the load-balancing calculations. A better approach is to use IP based LACP load-balancing, as each host will likely have a unique IP address.

A good rule of thumb is to use MAC address load-balancing if you’ve got a flat Layer 2 network. It is easier for the switch to identify MAC addresses and in a network like this, every flow should be coming from a unique MAC address.

Server Based Failover and Load Balancing

Some NIC manufacturers have provided software to accommodate NIC failover and load-balancing without using LACP. See HP’s document describing the options they offer.

These configurations do work, however they are complex (in terms of traffic flow) and are therefore harder to troubleshoot in the event of network problems. Use LACP where possible, and these server based methods where necessary.