(mis)Adventures with Spanning Tree

Introduction

I know I’ve gone on and on about STP (Spanning Tree Protocol) but something happened today that reminded me of what Tom Jacoby of IOSecure once said to me. I don’t remember the exact phrase, but the general idea was that STP was prone to code error and administrative error, and as much as even a properly configured STP configuration works, not very many people really understand it.

That said, I’ve put together some nice STP setups in my career. Compared to some of the other modern methods we have it is old tech to use it for fault tolerance, but there is something satisfying about building the configuration. Maybe it is because it is hard and complicated — I’m not even sure why I like it.

What is on my mind today was a client who is growing out of a bunch of 3500XLs into 2950 switches. Even the 2950s aren’t exactly new, but this is my most cost sensitive client so I’m a bit hamstrung on the hardware. The configuration has a 3750 (stack of two switches) for the core that provides routing for the network, and a flat network beyond that is pushed to 19 edge switches (3500XLs of varying code levels, and the newer 2950s) — no trunking just access ports to the edge switches, and everything is running STP with the 3750 stack as the root switch.

Now this isn’t a best practices network by any means, but the customer has a legacy configuration to migrate from and it may very well be with them for some time. Part of my job is to provide the guidance to move towards a scalable, stable and supportable network.

We had some problems setting up LACP on the 3500XLs (they only supported Cisco’s proprietary Etherchannel and in a very feature-poor way), but it was working great on the 2950s. My client was so impressed by the idea of having redundancy for the edge switches, that we created backup links for each edge switch (even on the 3500XLs) and let STP sort out the loops where we weren’t using Etherchannel. Some time passed, and things seemed like they were working fine until out of the blue all the 2950s stopped working.

Problem

The 2950s just dropped link. My client would connect a different port and it would light up, and then drop the link again — not even a light on the 3750 interface. I suspected some security profile issue, but as there wasn’t any of that configured it seemed like a long shot, so we agreed that my client would revert the 2950s to his stock of 3500XLs and figure out the 2950s later. Unfortunately the problems persisted through the next day so we decided to figure out why the 2950s failed.

The 3750 was reporting weird errors in its log:

Jul 4 22:40:31.438 UTC: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.c6ca.9aff in vlan 1 is flapping between port Po19 and port Fa1/0/6

What the heck? MAC address flapping? There must be a loop somewhere. But because the 3750 wasn’t experiencing any actual CPU pain I left this and focused on the 2950s to see if I could get them working and back into production to save the day.

The 2950 console reported this:

[code]]czo4NzpcIiVQTS00LUVSUl9ESVNBQkxFOiBsb29wYmFjayBlcnJvciBkZXRlY3RlZCBvbiBHaTQvMSwgcHV0dGluZyBHaTQvMSBpbiB7WyYqJl19ZXJyLWRpc2FibGUgc3RhdGVcIjt7WyYqJl19[[/code]

Another weird error! Err-disable state and yet we haven’t configured any security policies configured on the interface — this error message led me to Cisco’s site and an explanation of how the 2950s handle L2 keepalives which is a default configuration (and so doesn’t show up in the configuration). Cisco’s interpretation is that the 2950s put out a keepalive to detect physical wiring loops, but will also trigger in the event of a STP loop anywhere on the network. You can disable the keepalive and get the switches to connect, but that doesn’t make the loop go away so I set to work on finding it.

Resolution

I went through each of the 3500XLs, and discovered two in which STP had not blocked the second link to the 3750. This is how I figured out which ones were the culprits:

  1. show cdp neighbors
  2. show spanning-tree (brief)

In the first switch, CDP showed the 3750 connected on ports 24 and 48, but STP didn’t show anything for port 48, so I had to assume that it was also forwarding even though it should show a FORWARDING state in that case. In the second switch, CDP showed the 3750 connected on ports 24 and 48, but this time STP actually showed both ports in the FORWARDING state.

To resolve the issue, I just killed the second link. As soon as I did that the MAC flap entries in the 3750 log stopped, and the 2950s were able to connect again.

Conclusion

The moral of this story is don’t put all your eggs in the STP basket — especially with very old switches and very old STP code. Maybe a better moral is don’t implement advanced features with old code — the 3500XLs are doing just fine switching but as soon as we asked them to do something out of the ordinary two of them fell on their faces.