Part of my day job is assisting people with their infrastructure needs. One large chunk of that is being on point for upgrades to hardware and software – especially since this portion of the inventory lifecycle is usually the most fraught with issues. 🙂
As such, I’ve done a lot of work with Cisco UCS around infrastructure and management software upgrades. In this post, I’ll cover a somewhat rare issue that has hit me twice so far when upgrading from firmware build 2.0.X to 2.1.X. It’s a documented defect that is resolved in later builds, so eventually it will go away. I would imagine that many people are still in the process of moving off 2.0.X today, so here we go.
The Upgrade Symptom
While upgrading a blade server from 2.0.x to 2.1.x, you might monitor the Finite State Machine (FSM) tab and see the upgrade process stall out at around step 30 or 37 with a large number of retries and ultimately a failure result. Both of these steps have to do with the BIOS – either waiting on it to load, or waiting on it to upgrade.
Eventually, the upgrade workflow will fail, the blade server will reboot, and the process loops infinitely until you stop it. Defect CSCub55065 describes the issue somewhat vaguely, but offers solid advice on how to fix the problem without an RMA:
I’m going to walk through both the decommission and corrupt BIOS recovery below. So far, the corrupt BIOS recovery step has fixed my issue both times, but your mileage may vary.
Trick #1 – Decommission and Re-commission
The act of decommissioning a blade is much like admitting that you’re going to remove the hardware from inventory. The system is told to flush out any data for the object. It’s common to trigger a decommission whenever you wish to actually remove hardware (imagine that?) without having the system go bonkers with alerts about a missing blade or chassis.
The workflow is super simple. Navigate to the Equipment tab of UCS Manager, click on the blade server, choose the Server Maintenance option in the Actions menu, and then select Decommission. I’ve outlined the process in the graphic below:
Once you click OK, a warning will present itself stating “Decommissioning this server will shut it down. Are you sure you want to decommission?” Since this server is already out of service for a firmware upgrade, there shouldn’t be anything actively running on it, and you can click Yes. If you aren’t absolutely, 100% sure, click No and double check that you have the correct blade server selected. My OCD usually flares up around this point to double (or triple) check.
[symple_box color=”red” text_align=”left” width=”100%” float=”none”]
Tip: I changed the User Label note on the blade server with an issue to “Busted Server” to help draw my attention to the server with an issue. It should also have an orange or red box around it due to the firmware installation failure. Again, I’m very cautious.
Once you accept the decommission action, UCS Manager will acknowledge with a “maintenance task successfully started for Server X” popup and the server status will change to Needs Resolution. Click on the Re-acknowledge slot action to allow UCS Manager to scan the blade server and place it back into inventory, as shown below:
After a brief period of time, the blade server should once again be available inside of UCS Manager. You can now try the firmware upgrade again. If it is not successful, more drastic measures may be needed (see the next section).
Trick #2 – Recover the Corrupted BIOS Firmware
In some situations, the BIOS firmware is simply corrupt on the blade. Bummer! Fortunately for you, this is relatively simple to fix.
Starting from the Equipment tab in UCS Manager, navigate to the blade server with the issue, click on the Inventory tab, then the Motherboard sub-tab, then choose the Recover Corrupt BIOS Firmware link. A window will appear asking what version you wish to activate. You can refer to the BIOS information (near #6 in my graphic below) for the running and startup versions of BIOS firmware. As you change the version to be activated, the value of the startup version will change to match. Once you’ve selected the version to activate – which could just be the same one that is loaded or a newer version if you’re still trying to upgrade – click the OK button.
The entire process is shown below on a B200 M2 server running 2.2(1b) as an example:
Acknowledge the warning that will popup and sit around nervously waiting to see if it does the trick. I tend to monitor the FSM (Finite State Machine) tab to keep an eye on the overall progress. This step seems to be the most effective at clearing BIOS related upgrade issues from UCS 2.0 to a later version.
I’m not trying to put a target on UCS saying that it’s a bad product, as I’ve literally upgraded firmware on hundreds of blades across dozens of Fabric Interconnects and only encountered this BIOS firmware bug twice. But it caught me off guard, so I’m documenting it here for both you and I to have for safe keeping. 🙂
I love the fact that I can lay down new or current firmware without needing a boot CD, ISO, or any other wizardry. I just tell UCS Manager to do it and it, well … does it. Groovy.