Wireless 802.1x Machine Authentication

Last week a client of mine asked me to put together an interesting wireless configuration.  The requirement was 802.1x wireless authentication, but when we first set it up they found that if a technician stepped over to someone else’s (wirelessly connected) desktop, they would be unable to log in because their credentials were not cached on the system.  Only by connecting the computer to a wired port to allow the OS to talk to the AD server were they able to finally log into a user’s laptop.

They asked me if we could setup the laptops to maintain a wireless network connection using machine authentication when at the login screen, and revert to user authentication when somebody logs into the laptop — allowing them to control who has access to the wireless network.  I have to admit, I was initially skeptical that this would work but despite the complexity of the configuration I managed to do it.

Before I go into the details, here are the caveats as I see them.

  1. Because laptops will connect to wireless even when nobody is logged in, they will be actively using their radio all the time and this could potentially drain the battery of a laptop much more quickly than might be expected otherwise.
  2. There is always a security risk of an always-connected computer, and this is made more significant if that network is wireless.  We are currently pretty safe with WPA2, but this situation could change.
  3. Lastly, there are some fairly in-depth Windows configurations going on here.  I’m not an expert in Windows systems; these solutions I’m going to describe below are done by hand — and I expect that you will want to script these changes into Group Policy instead of visiting every laptop in your domain one by one.

First you should get 802.1x user level authentication working.  See the following links for some guidance:

The configuration on the wireless controller is pretty trivial, just configure a radius server and set the WLAN to WPA/WPA2 with 802.1x authentication.

The first policy I created in MS IAS radius restricted authentication to the groups Domain Computers, and Wireless Users.  Next I created a second radius policy on IAS to restrict authentication to the Domain Computer group only.

You need to configure a CA on your domain.  My client used their Domain Controller, but there might be a better way to do this.  From the CA, we exported a certificate that was loaded onto a laptop with a USB key — you could probably do this via GPO.  Importing the certificate posed some problems, as the default behaviour is to store the certificate in a user profile.  This meant that the systems were able to connect to wireless when a user was logged in, but after a reboot it wouldn’t work — so when installing the certificate we placed it manually.

  • install CA server certificate on laptop
  • Do NOT choose automatic installation.
  • Place all certificates in the following store: (click the checkbox to show physical stores)
  • Select Trusted Root Certification AuthoritiesLocal Computer

Next we had to configure the laptops to connect to wireless even when nobody was logged into the computer.  To do this, I used this MS Support article: http://support.microsoft.com/kb/309448

  • HKEY_LOCAL_MACHINESOFTWAREMicrosoftEAPOLParametersGeneralGlobalAuthMode DWORD=1

Finally I configured the WLAN on the client to handle this configuration.

  • Association Tab
  • network authentication — wpa2
  • data encryption — aes
  • Authentication Tab
  • EAP type — PEAP
  • CHECK auth as computer
  • CHECK auth as guest
  • PEAP Properties
  • CHECK validate server certificate
  • CHECK YOUR-CA-HERE in the list of Trusted Root Certification Authorities
  • Select Auth Method — Secured password EAP-MSCHAP v2
  • CHECK Enable Fast Reconnect
  • EAP-MSCHAPv2 Properties
  • CHECK automatically use my windows logon name and password
  • set to automatically connect

After all these steps, we had the laptops connecting properly.  We passed the laptop around the room, having people who had never touched it before log in with their AD credentials.  If one of these users were not in the Wireless Users AD group they would not be permitted wireless access when logged in.

(mis)Adventures with Spanning Tree

Introduction

I know I’ve gone on and on about STP (Spanning Tree Protocol) but something happened today that reminded me of what Tom Jacoby of IOSecure once said to me. I don’t remember the exact phrase, but the general idea was that STP was prone to code error and administrative error, and as much as even a properly configured STP configuration works, not very many people really understand it.

That said, I’ve put together some nice STP setups in my career. Compared to some of the other modern methods we have it is old tech to use it for fault tolerance, but there is something satisfying about building the configuration. Maybe it is because it is hard and complicated — I’m not even sure why I like it.

What is on my mind today was a client who is growing out of a bunch of 3500XLs into 2950 switches. Even the 2950s aren’t exactly new, but this is my most cost sensitive client so I’m a bit hamstrung on the hardware. The configuration has a 3750 (stack of two switches) for the core that provides routing for the network, and a flat network beyond that is pushed to 19 edge switches (3500XLs of varying code levels, and the newer 2950s) — no trunking just access ports to the edge switches, and everything is running STP with the 3750 stack as the root switch.

Now this isn’t a best practices network by any means, but the customer has a legacy configuration to migrate from and it may very well be with them for some time. Part of my job is to provide the guidance to move towards a scalable, stable and supportable network.

We had some problems setting up LACP on the 3500XLs (they only supported Cisco’s proprietary Etherchannel and in a very feature-poor way), but it was working great on the 2950s. My client was so impressed by the idea of having redundancy for the edge switches, that we created backup links for each edge switch (even on the 3500XLs) and let STP sort out the loops where we weren’t using Etherchannel. Some time passed, and things seemed like they were working fine until out of the blue all the 2950s stopped working.

Problem

The 2950s just dropped link. My client would connect a different port and it would light up, and then drop the link again — not even a light on the 3750 interface. I suspected some security profile issue, but as there wasn’t any of that configured it seemed like a long shot, so we agreed that my client would revert the 2950s to his stock of 3500XLs and figure out the 2950s later. Unfortunately the problems persisted through the next day so we decided to figure out why the 2950s failed.

The 3750 was reporting weird errors in its log:

Jul 4 22:40:31.438 UTC: %SW_MATM-4-MACFLAP_NOTIF: Host 0000.c6ca.9aff in vlan 1 is flapping between port Po19 and port Fa1/0/6

What the heck? MAC address flapping? There must be a loop somewhere. But because the 3750 wasn’t experiencing any actual CPU pain I left this and focused on the 2950s to see if I could get them working and back into production to save the day.

The 2950 console reported this:

[code]]czo4NzpcIiVQTS00LUVSUl9ESVNBQkxFOiBsb29wYmFjayBlcnJvciBkZXRlY3RlZCBvbiBHaTQvMSwgcHV0dGluZyBHaTQvMSBpbiB7WyYqJl19ZXJyLWRpc2FibGUgc3RhdGVcIjt7WyYqJl19[[/code]

Another weird error! Err-disable state and yet we haven’t configured any security policies configured on the interface — this error message led me to Cisco’s site and an explanation of how the 2950s handle L2 keepalives which is a default configuration (and so doesn’t show up in the configuration). Cisco’s interpretation is that the 2950s put out a keepalive to detect physical wiring loops, but will also trigger in the event of a STP loop anywhere on the network. You can disable the keepalive and get the switches to connect, but that doesn’t make the loop go away so I set to work on finding it.

Resolution

I went through each of the 3500XLs, and discovered two in which STP had not blocked the second link to the 3750. This is how I figured out which ones were the culprits:

  1. show cdp neighbors
  2. show spanning-tree (brief)

In the first switch, CDP showed the 3750 connected on ports 24 and 48, but STP didn’t show anything for port 48, so I had to assume that it was also forwarding even though it should show a FORWARDING state in that case. In the second switch, CDP showed the 3750 connected on ports 24 and 48, but this time STP actually showed both ports in the FORWARDING state.

To resolve the issue, I just killed the second link. As soon as I did that the MAC flap entries in the 3750 log stopped, and the 2950s were able to connect again.

Conclusion

The moral of this story is don’t put all your eggs in the STP basket — especially with very old switches and very old STP code. Maybe a better moral is don’t implement advanced features with old code — the 3500XLs are doing just fine switching but as soon as we asked them to do something out of the ordinary two of them fell on their faces.

Upgrading The IOS On A Switch (The Right Way)

Since I’ve made a couple of postings describing flash upgrade horror stories, I thought I’d include a description of how to do it the right way.

Selecting an image

First verify your hardware, either use the web interface or use show version at the CLI.

  1. Go to cisco.com and login with your CCO account.
  2. Click Support from the menu
  3. Click Download Software in the “select a task” window
  4. Choose Switches Software from the list
  5. Choose LAN Switches from the list
  6. Expand the Cisco Catalyst 2950 Series Switches group
  7. Choose the type of 2950 switch, in this example choose Cisco Catalyst 2950G 48 EI Switch
  8. Choose IOS Software from the list
  9. And select the software version you want

Some advice on choosing images:

  • Generally speaking, I’d say take the latest image. If your hardware is relatively old, then the newest code is probably bugfixes only and will not be introducing new features (and their bugs). If this was newer hardware, I’d recommend you carefully read the release notes.
  • Never load the deferred releases — these have serious bugs in stability or performance. If the code you’re running right now is deferred, or isn’t listed at all then I’d say an upgrade to newer code is very important.
  • Choose an image that represents what flash resources you have — I like the image C2950 EI AND SI IOS CRYPTO AND WEB BASED DEVICE MANAGER because it supports ssh, and it has a web interface (SDM) built in. But the web SDM takes up flash space, so make sure you review your switch flash resources to make sure it can handle the larger file.

Prepare the TFTP server with images

You will need the tftp SERVER, not just the client. Check your documentation, but in Linux you usually have to modify /etc/default/tftpd to enable the tftpd service. Solarwinds makes a free win32 tftp server as well.

The tftpd root folder can be anywhere, but usually it is either /tftpboot or /var/lib/tftpboot; in Windows you can easily specify this folder to be wherever you want, but I’d recommend you create a tftpboot folder off the root of your drive. Put the tar files in the tftpd root folder.

It is a little easier to test these issues from a PC client; so you can save some troubleshooting time by testing if you can download the image from a PC tftp client, Windows ships with tftp so just login and download a file to test. If you’re running Linux then you might have to load the tftp client from your distribution repos.

Loading the new image

Connect to the CLI, and go into enable mode. You can tell you’re in enable mode when you see the # in the prompt.

[code]]czo2NTpcIlVzZXIgQWNjZXNzIFZlcmlmaWNhdGlvbg0KUGFzc3dvcmQ6DQpTVzAyJmd0O2VuDQpQYXNzd29yZDoNClNXMDIjXCI7e1smKiZdfQ==[[/code]

Verify the space in the flash with dir. Chances are pretty good there is only room for one image at a time, but if you’re lucky you can fit two images.

[code]]czo2MDA6XCJTVzAyI2Rpcg0KRGlyZWN0b3J5IG9mIGZsYXNoOi8NCjIgICAtcnd4IDM3MjE5NDYgTWFyIDA3IDE5OTMgMjI6Mzk6MDJ7WyYqJl19ICswMDowMCBjMjk1MC1pNmsybDJxNC1tei4xMjEtMjIuRUExMy5iaW4NCjMgICAtcnd4ICAgIDIzMTYgTWF5IDI5IDIwMDggMjA6M3tbJiomXX0yOjEwICswMDowMCB2bGFuLmRhdA0KNCAgIC1yd3ggICAgIDExMiBNYXIgMDcgMTk5MyAyMjozNjoxMCArMDA6MDAgaW5mbw0KMTYge1smKiZdfSAtcnd4ICAgIDI4MDQgTWFyIDAxIDE5OTMgMDA6MDg6MjUgKzAwOjAwIGNvbmZpZy5vbGQNCjIyICAtcnd4ICAgICAzMzMgTWFyIDB7WyYqJl19NyAxOTkzIDIyOjQwOjQ4ICswMDowMCBlbnZfdmFycw0KMzM0IC1yd3ggICAgICAyNCBNYXIgMDEgMTk5MyAyMTowNDoyOCArMDA6MHtbJiomXX0wIHByaXZhdGUtY29uZmlnLnRleHQNCjYgICBkcnd4ICAgIDQ0MTYgTWFyIDA3IDE5OTMgMjI6NDA6MDQgKzAwOjAwIGh0bWwNCjE5e1smKiZdfSAgLXJ3eCAgICAgMTEyIE1hciAwNyAxOTkzIDIyOjQwOjM5ICswMDowMCBpbmZvLnZlcg0KMjQgIC1yd3ggICAgMjA2OCBNYXIgMDF7WyYqJl19IDE5OTMgMjE6MDQ6MjggKzAwOjAwIGNvbmZpZy50ZXh0DQo3NzQxNDQwIGJ5dGVzIHRvdGFsICgyMTM4NjI0IGJ5dGVzIGZyZWUpXCJ7WyYqJl19O3tbJiomXX0=[[/code]

If by some happy chance you’ve got more than 16Mb flash you probably have room for two images. But I think this is not the case, so you will have to erase the flash. You can go file by file with delete, but you’re better off just recursively wiping it like this.

[code]]czoyOTpcIlNXMDIjZGVsZXRlIC9yZWN1cnNpdmUgZmxhc2g6XCI7e1smKiZdfQ==[[/code]

Give yourself some peace of mind, and write the current running configuration to flash again to make sure it isn’t lost when you reboot. The startup config is written to an internal nvram flash, so this probably isn’t required, but I do it because it makes me feel better.

[code]]czoxOTpcIlNXMDIjY29weSBydW4gc3RhcnRcIjt7WyYqJl19[[/code]

Now the switch is ready to load the images from your TFTP server. Remember that TFTP is a UDP protocol, and has no inherent error correction — so make sure the link between the TFTP server and the switch is not lossy (like wireless or over the internet), a server on the same IP subnet is a good solution.

You can load a binary IOS image by tftp like this: copy tftp flash and then just follow the prompts in the script. But the problem is that the binary IOS image doesn’t have the web interface files, and you (probably) don’t have the space to copy the archived binary+webfiles and then extract it locally. So Cisco has a script that can download the archived binary+webfiles and extract it on the fly.

[code]]czo4NjpcIlNXMDIjYXJjaGl2ZSB0YXIgL3h0cmFjdCB0ZnRwOi8vMTkyLjE2OC4yLjExL2MyOTUwLWk2azJsMnE0LXRhci4xMjEtMjJ7WyYqJl19LkVBMTMudGFyIGZsYXNoOlwiO3tbJiomXX0=[[/code]

And now you watch the magic. Once the system is done downloading and extracting the images, you will reload and if all went well, you’ll be running on the new code.

NB: If you managed to squeeze two IOS binaries on one image, you will have to specify which one you want to boot from — this is not an issue if you erased the flash as described earlier.

  1. Figure out exactly which binary you want to boot from; use dir to list the files and look for those that end in “.bin”
  2. Make sure that there isn’t already a boot variable set; sh run | i boot should list any entries. If they’re in there, remove them. These boot parameters are processed in order, so you want to make sure your new image is first in the config — you can add the lines for any other images you’ve got on the flash after your new image.
  3. Set the new boot image:
[code]]czo5NjpcIlNXMDIjY29uZiB0DQpTVzAyKGNvbmZpZykjYm9vdCBzeXN0ZW0gZmxhc2g6L2MyOTUwLWk2azJsMnE0LW16LjEyMS0yMi57WyYqJl19RUExMy5iaW4NClNXMDIoY29uZmlnKSNlbmRcIjt7WyYqJl19[[/code]

Finally you write the new configuration and reload.

[code]]czozMjpcIlNXMDIjY29weSBydW4gc3RhcnQNClNXMDIjcmVsb2FkXCI7e1smKiZdfQ==[[/code]

Recovering A Bad Flash — Part 2

Today I encountered another flash problem. I was working on an older switch, a WS-C2950G-48-EI running 12.1(13)EA1c. My client had asked me to upgrade the software because he wanted access to the html interface on the switch.

Now I’m at least an hour away from this client so I wanted to do the upgrade from my office, but I don’t like to do TFTP over the greater Internet as there’s no error correction with this.

With newer hardware we can upgrade with HTTP — this is fantastic because I can just load the code onto a webserver here in my office, punch a hole in the firewall and upgrade the hardware. Easy as pie.

But this old switch only supports FTP and TFTP. I figured I’d give FTP a whirl — it is a bit more complex than http (okay a lot more complex) but it should work.

[code]]czo3NTpcImNvbmYgdA0KaXAgZnRwIHVzZXJuYW1lIHBhdWwNCmlwIGZ0cCBwYXNzd29yZCBzb21lcGFzcw0KZW5kDQpjb3B5IGZ0cCB7WyYqJl19Zmxhc2hcIjt7WyYqJl19[[/code]

And away you go! You can also do this:

[code]]czo1MzpcImNvcHkgZnRwOi8vcGF1bDpzb21lcGFzc0BzZXJ2ZXJpcC9maWxlbmFtZS5iaW4gZmxhc2g6XCI7e1smKiZdfQ==[[/code]

You can do this too if you’re working with Cisco’s tar files:

[code]]czo2ODpcImFyY2hpdmUgdGFyIC94dHJhY3QgZnRwOi8vcGF1bDpzb21lcGFzc0BzZXJ2ZXJpcC9maWxlbmFtZS50YXIgZmxhc2g6XCI7e1smKiZdfQ==[[/code]

I initally had a problem with FTP getting through the firewall — the client was throwing an error: “no such user”. That didn’t make sense because the username definitely works. I double checked locally, and remotely from another server to make sure my credentials worked.

The problem was that FTP uses a data channel (port 20) in addition to the control channel (port 21) and this can be a bit confusing for firewalls to track the sessions. So I turned on FTP inspection on my office firewall and things seemed to work, at least the debug suggested that the client was able to login.

You can debug FTP client sessions like this:

[code]]czo3MDpcImRlYnVnIGlwIGZ0cA0KdGVybSBtb24gIyB0aGlzIHdpbGwgcmVkaXJlY3QgZGVidWcgbG9ncyB0byB5b3VyIHNlc3Npb257WyYqJl19XCI7e1smKiZdfQ==[[/code]

But it still wasn’t working. The sessions would login, start a download but never finish. The debugs showed that the client sent an ABOR code — which aborts the download.

Interestingly, many clients don’t handle the ABOR command well. The site FTPGuide.com says: The abort command may require “special action” to force recognition from the server. Unfortunately the only special action I had at my immediate disposal was to kill the terminal session.

That was my fatal mistake.

Three FTP sessions later, and I can’t view the flash drive anymore. sh flash and dir both report:

[code]]czozOTpcIiVFcnJvciBvcGVuaW5nIGZsYXNoOi8gKE5vIHN1Y2ggZGV2aWNlKVwiO3tbJiomXX0=[[/code]

Now I’m in trouble. I’ve deleted the old image to make space for the new one, and now I can’t write the new image because the flash is non-responsive. I was able to erase flash and format flash but while these claimed to work, neither of them helped to make the flash usable again.

Still I suspected this problem was related to the FTP sessions I attempted, so I found a new command (new to me):

[code]]czoxOTU6XCIjc2ggZmlsZSBkZXNjcmlwdG9ycw0KRmlsZSBEZXNjcmlwdG9yczogRkQgUG9zaXRpb24gT3BlbiBQSUQgUGF0aA0KMCB7WyYqJl19MCAwMDAxIDUyIGZ0cDovL3NlcnZlci9jMjk1MC1pNmsybDJxNC10YXIuMTIxLTIyLkVBMTMudGFyDQoxIDAgMDAwMSA2NSBmdHA6L3tbJiomXX0vc2VydmVyL2h0bWwudGFyDQoyIDAgMDAwMSA2NiBmdHA6Ly9zZXJ2ZXIvZm9vLnR4dFwiO3tbJiomXX0=[[/code]

Ah ha! Clearly this is my problem. These FTP sessions are tying up the flash, if only I could kill the PIDs listed here, but after much searching I find that Cisco doesn’t allow us to kill individual PIDs. I guess they feel that if a PID doesn’t close nicely then something is so wrong with the device that it needs more help than just a kill.

But that doesn’t help me. I realize that I can clear tcp sessions that are originating from the system, if only I know the source and destination ports. I had my firewall log open and was able to pick out the ports and clear these sessions, but I could also have used show ip sockets to figure it out.

[code]]czo1MDpcIiNjbGVhciB0Y3AgbG9jYWwgc3dpdGNoSVAgMTEwMjAgcmVtb3RlIHNlcnZlcklQIDIxXCI7e1smKiZdfQ==[[/code]

After clearing a few of these I was able to get access to the flash drive and load an image back on. This time I asked my client to setup a TFTP server locally and I used that.

The moral of the story — don’t use FTP.

Recovering A Bad Flash — Part 1

Some people are perfect, they never make mistakes, and they never put themselves in a position where they have to dig themselves out of whatever problem they’ve caused.

I’m not that person. I make mistakes, but I learn from them. I learn how to solve the problems those mistakes cause, and I learn more about the systems themselves. I have to think this is pretty normal — and as my mentor Tom Jacoby always says: “the definition of an expert is someone who has already made all the mistakes!”

So yesterday I made a mistake that made me feel like a junior tech all over again — if only until I knew how I was going to fix it. I was upgrading software on four pre-production routers, and one of these had a flash that was just big enough for one image at a time. Standard procedure here is to delete the old image, load a new one on and reload the system.

This was no problem — I had already done another identical machine and it came up fine… The problem was this was the last router I had to work on so while it was copying I started cleaning up and then I find myself holding the power cables for all the routers in my hands. They’re all offline.

And now the last router will not boot — it just drops into ROMMON. I’ve recovered hardware over console xmodem and that is not nice — it can take hours to transfer a 30 meg image. Thankfully these routers use CF cards, and thankfully I have a USB CF reader. I mounted the card on my laptop, copied the new image over, booted the system and I was off to the races.

So what did I learn? Don’t rush the job. Think about what you’re doing and if you’re in a hurry just put that hurry out of your mind. Small mistakes can cause huge amounts of pain — if these were production routers instead of pre-production it would have been a very difficult day for me.

All About Spanning Tree Protocol

Spanning Tree Protocol

I discussed STP in a couple of earlier articles, here and here, but I would like to go into a little more detail because I think this is really important.

Spanning Tree Protocol can help us design fault-tolerant networks in two ways; primarily by detecting and disabling port misconfigurations, and secondly by allowing administrators to build failover network links.

STP is a complicated protocol, and it comes with a suite of different applications that can help fine-tune the system. I highly recommend a network engineer studies the Cisco documentation on STP, and then builds a lab environment before deploying.

CIO.com has a fascinating article that describes how critical STP is, and how good network design can help eliminate the worst of these consequences.

STP Root

The first thing to remember about STP is to ensure that the root switch is defined properly. The root switch should be the most central switch on the network (not necessarily the most powerful), and it should be the least disturbed switch — this means the LAN core switches are often the best choice as a STP root.

STP can choose a root switch automatically, but unfortunately this is not usually what you want. In fact, one of STPs parameters is to select the switch with the lowest MAC address, which is usually the oldest switch on the network. That in itself wouldn’t be a problem, except that on many networks the oldest switches are pushed to the edge of the network.

STP Parameter Tuning

STP is designed with an average network in mind. But your network isn’t average — to get the best performance out of STP you will have to modify the parameters that make it work.

Here is a Cisco guide on this — Understanding and Tuning Spanning Tree Protocol Timers. Note Cisco’s caution note: If you make mistakes with tuning STP you risk “LAN meltdown”.

The simplest way to tune the STP timers is to set the STP diameter. The link above has a guide to help calculate the STP diameter of your network — but be wary while setting this parameter. If your network changes (future adds/moves) it may expand beyond the STP diameter that you have set and then bad things happen.

The diagram above duplicates what is in the Cisco document, but it identifies one of the calculations. You need to identify the worst-case scenario for the number of switches a packet has to cross — in this case it is CADBE, and DACBE; STP diameter here is 5.

My advice is to leave these parameters alone unless you have a good reason to make changes. The risk is high, and what is gained is a slightly more rapid recovery from a network failure — the long term consequence is that someone will have to remember that this was done.

Loop Protection

Loops in Layer 2 networks are very, very bad. The layer 2 header has no time-to-live value, so a looped frame can continue to loop forever. Add in some broadcast traffic and you have a recipe for disaster, Cisco calls it LAN meltdown.

A key element of STP is that it prevents loops — STP is designed to detect and resolve Layer 2 loops. It is not enough to run STP only on the core switches of a network (although it helps), to fully protect a LAN all switches must be running STP.

Redundant Connections

STP can be used to create failover links within the LAN, allowing the network administrator to design networks with multiple links between switches. Essentially the network administrator intentionally creates a loop, and allows STP to block one of the links. If the primary link were to fail, then STP would re-calculate the topology and would bring up the secondary link.

Host Port Configurations

Running STP on your network has some interesting side effects. It sure is great knowing that Layer 2 loops are automatically detected and mitigated, but sometimes running STP can make life difficult for users.

For example, when a port comes online STP does not trust the interface. That means it will listen to the port to see if any BDPUs come through, but it will not forward any traffic. The process from Blocking to Listening, to Learning and finally to Forwarding state can take up to a minute. Unfortunately, that means no traffic will pass the interface, which can cause some hosts that depend on DHCP for IP address assignment to fail. If you unplug and reconnect the network cable (or disable and enable the interface) the cycle starts again so that tried and true end-user approach to fixing network faults will not help. Thankfully most often a user can simply “Repair” the network connection in software, and the OS will make a successful DHCP transaction.

Cisco (and other network vendors) have created a clever solution for this problem. spanning-tree portfast disables STP on a port; so ports are always in a forwarding state.

Unfortunately, spanning-tree portfast leaves your network at risk from a user inadvertently connecting a single cable to two ports — creating a network loop that cannot be detected that can potentially eat up all the CPU processing on the switch. Cisco (and other network vendors) have come up with a solution for this too. spanning-­tree portfast bpduguard will automatically disable a port if a single BPDU frame is detected. This is not an intelligent protocol, so do NOT use spanning-tree portfast bpduguard on any switch uplinks as it will definitely shut those down as soon as it receives the first BPDU frame.

Usually when a port is shutdown, the user (or network administrator) will see the problem and resolve it. But if the port remains disabled, it requires an administrator to manually bring the port online. Cisco decided this is too much work for us network administrators, so they defined the errdisable recovery cause bpduguard command — this brings the port back online after 5 minutes (by default). If the loop still exists, it will be detected and the port will go offline again

Advanced Configurations

All of the below configurations are discussed in detail in the Cisco documentation on STP, but I will review them in brief.

Bpdufilter is an alternative to bpduguard. Like bpduguard it watches for incoming BPDU frames, but in addition it filters outbound BPDU frames (less traffic for hosts which discard these frames anyway), and if a BPDU frame is received on this port (a switching loop is created) then the port loses its portfast status and STP starts to monitor the port. There is some risk here in the event of a user looping two ports at their own desk, as the loop would not be detected.

Uplinkfast allows the network administrator to define redundant switch uplinks so they skip the Listening and Learning state in the event of a link failure. This allows the network to converge faster (Cisco says about 5 seconds) than it would have done otherwise.

Backbonefast allows switch backbone interfaces to detect and resolve indirect link failures faster than the normal STP timeouts would allow. Essentially, it allows a switch to detect a network link failure on a neighboring switch, and update its own topology very quickly.

Rootguard prevents a particular port from becoming a root port. This prevents a new switch connected to the edge of the network from becoming a root switch. Recall that the network administrator spends a lot of effort design the STP topology so the root switch is in a logical location — usually the core.

Loopguard detects a unidirectional link (usually a wiring fault) and moves the interface to the Listening state. This prevents STP calculation failures and neighboring switches will make incorrect assumptions about the network topology. There is another Layer 2 solution to this problem as well called udld which may provide more interoperability and flexibility than loopguard; Cisco has documentation on udld and loopguard here.

Rapid STP

Cisco has a white paper describing RSTP; the document is very concise and explains the differences between RSTP and STP quite clearly.

In short RSTP brings in the features of uplinkfast and backbonefast, (which were Cisco proprietary features in STP), updates the algorithm so that topology detection cascades across the network, instead of the slow plodding detection of STP, and finally it updates the timers so that convergence happens much faster.

Downgrade from Lightweight to Autonomous Using Linux

Hello world,

I recently came across a solution for a small problem I was working on, and I thought I would share.

I have a Cisco 1131 AG lightweight access point that I wanted to downgrade to autonomous mode. Now this is easy if you have a WLC (Wireless LAN Controller) that these APs are designed for, but if you don’t you have to perform a downgrade manually.

There are lots of instructions on the Internet, but this is the Cisco documentation for this issue.

What Cisco leaves out is what the AP is expecting out of your TFTP server. When the AP loads up into recovery mode it is going to be attempting to connect to a tftp server at the broadcast address: tftp://255.255.255.255 — essentially this means that the AP will attempt to connect to an address that every device on the LAN has.

My laptop is running Ubuntu (a flavour of Linux) and I have the tftpd-hpa package as my TFTP server. Normally I like the setup I have, but today it wasn’t enough — the AP wasn’t connecting.

The CLI was throwing this error:

[code]]czoyMTY6XCJpbWFnZV9yZWNvdmVyeTogRG93bmxvYWQgZGVmYXVsdCBJT1MgdGFyIGltYWdlIHRmdHA6Ly8yNTUuMjU1LjI1NS4yNTV7WyYqJl19L2MxMTMwLWs5dzctdGFyLmRlZmF1bHQNCmV4YW1pbmluZyBpbWFnZeKApg0KZXh0cmFjdGluZyBpbmZvICgyODAgYnl0ZXMpDQpQcntbJiomXX1lbWF0dXJlIGVuZCBvZiB0YXIgZmlsZQ0KRVJST1I6IEltYWdlIGlzIG5vdCBhIHZhbGlkIElPUyBpbWFnZSBhcmNoaXZlLlwiO3tbJiomXX0=[[/code]

And when I used wireshark to look at the traffic I found that the tftp server wasn’t responding to the requests from the AP. So I configured tftpd-hpa to specifically listen on the broadcast address like this:

[code]]czoxNDQ6XCIkIGNhdCAvZXRjL2RlZmF1bHQvdGZ0cGQtaHBhDQojRGVmYXVsdHMgZm9yIHRmdHBkLWhwYQ0KUlVOX0RBRU1PTj0mIzN7WyYqJl19NDt5ZXMmIzM0Ow0KT1BUSU9OUz0mIzM0Oy1jIC1sIC1zIC92YXIvbGliL3RmdHBib290IC1hIDI1NS4yNTUuMjU1LjI1NSYjMzQ7XCJ7WyYqJl19O3tbJiomXX0=[[/code]

And restarted tftpd with /etc/init.d/tftpd-hpa.

Once I did that, the AP was able to connect and all was well in the world.

High Availability — LAN — STP

The parent article on High Availability.

Switching on a LAN provides some of the most basic network connectivity options, and are often overlooked. Nonetheless most switches (Cisco, HP, Dell and others) support these configurations, but one thing I can guarantee is that you will find limitations on pretty much every platform. If you’re after inter-operability, do your testing so you can understand these limitations.

Spanning Tree Protocol

I discussed STP in an earlier article, but I would like to go into a little more detail here.

Spanning Tree Protocol can help us design fault-tolerant networks in two ways; primarily by detecting and disabling port misconfigurations, and secondly by allowing administrators to build failover network links.

STP is a complicated protocol, and it comes with a suite of different applications that can help fine-tune the system. I highly recommend a network engineer studies the Cisco documentation on STP, and then builds a lab environment before deploying.

I also recommend Cisco’s STP Problems and Design Considerations document. It just might help identify why things are happening the way they are.

Loop Protection

Loops in Layer 2 networks are very, very bad. The layer 2 header has no time-to-live value, so a looped frame can continue to loop forever. Add in some broadcast traffic and you have a recipe for disaster, Cisco calls it LAN meltdown.

A key element of STP is that it prevents loops — STP is designed to detect and resolve Layer 2 loops. It is not enough to run STP on the core switches of a network (although it helps), to fully protect a LAN all switches must be able to run STP.

Redundant Connections

STP Example

If you configure two switches with two network connections, STP will detect the loop and block one of the ports. There are calculations that help STP decide which interface to block, but that is for a more technical review of STP.

In the example on the left, I’ve configured two switches to use two links between them. As long as the configuration stays simple; this would actually work with a STP capable switch and a dumb switch or even a hub.

STP detects the second link and blocks the port. The calculation of which port to block is determined by an algorithm based on a interface speeds. This is customizable, so you can make sure that STP opens and fails predictably.

Failure Mode

If something happens to the active interface, STP detects this change and stops passing traffic so it can recalculate the topology. Once it is complete things look as they do on the left.

Here STP detected a failure on the active interface, and opened the secondary connection.

STP doesn’t depend on physical failures to detect network changes — each switch is constantly sending out Hello frames to the root switch. If any switch on the network is disconnected from the root switch for 3 Hello frames the entire network stops and recalculates the topology.

Advanced Configurations

When designing a network, you must always consider the complexity if your design and the requirements of your client. Sometimes for a client without a network savvy administrator to maintain the network, relying on STP for redundancy is a bad idea; and there are other options.

High Availability – LAN – STP

Hello everyone, I have moved my blog postings so they are viewed directly on my site. I will not be making any more postings on wordpress, but you can get my full content including old postings as well as the new stuff here:

http://wozney.ca/blog

And here is a link to the original article that you were looking for!
http://wozney.ca/2008/10/30/high-availability-%E2%80%94-lan-%E2%80%94-stp/

See you there!

Paul