Abnormal patchmgr Termination when upgrading RoCE Leaf Switches may leave Ports in a Shutdown State (Doc ID 2984407.1)

Last updated on NOVEMBER 03, 2023

Applies to:

Cisco Nexus Switch - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

If patchmgr process terminates or network connection fails while patching a RoCE switch the ports can be left in a shutdown state.

One Example of a failed Roce switch upgrade

With arguments: --roceswitches /u01/patches/rocesw_group --upgrade
2023-10-15 09:52:10 -0400 1 of 2:Running upgrade on switch rocea0
2023-10-15 09:52:14 -0400: [INFO ] Performing Nodes connectivity tests on rocea0
2023-10-15 09:52:20 -0400: [SUCCESS ] Nodes connectivity tests on rocea0 are successful
2023-10-15 09:52:23 -0400: [INFO ] EPLD Version - Found: 0x5/0x11, Required: 0x5/0x16
2023-10-15 09:52:30 -0400: [INFO ] Switch rocea0 will be upgraded from nxos.7.0.3.I7.9.bin to nxos64-cs.10.2.4.M.bin
2023-10-15 09:52:30 -0400: [INFO ] Checking for free disk space on switch
2023-10-15 09:52:30 -0400: [INFO ] disk is 91.00% free, available: 107050528768 bytes
2023-10-15 09:52:30 -0400: [SUCCESS ] There is enough disk space to proceed
2023-10-15 09:52:31 -0400: [INFO ] Found nxos64-cs.10.2.4.M.bin on switch, skipping download
2023-10-15 09:52:31 -0400: [INFO ] Verifying sha256sum of bin file on switch
2023-10-15 09:52:57 -0400: [SUCCESS ] sha256sum matches: 84f930ca02487dd8a881049d65fd1bbdc8882841de88cae0bd176c494054aff2
2023-10-15 09:55:09 -0400: [INFO ] Performing FW install of nxos64-cs.10.2.4.M.bin on rocea0
2023-10-15 09:57:09 -0400: [INFO ] reload of rocea0 is in progress
2023-10-15 10:03:44 -0400: [FAIL ] [FirmwareUpgradeError] switch rocea0 failed to come up <---- Patchmgr waits about 6 minutes after the reload

SUMMARY OF ERRORS:

2023-10-15 10:03:44 -0400: [FAIL ] [FirmwareUpgradeError] switch rocea0 failed to come up
2023-10-15 10:03:44 -0400 :FAILED : upgrade 2 RoCE switch(es) to 10.2.4
2023-10-15 10:03:44 -0400 :ERROR : FAILED run of command:./patchmgr --roceswitches /u01/patches/rocesw_group --upgrade
2023-10-15 10:03:45 -0400 :INFO : upgrade performed on switch(es) in file /u01/patches/rocesw_group: [ rocea0 roceb0]

NOTE: In the above example it is reported that rocea did not come up and caused patchmgr to terminate. In most cases this is due to some
other network issue or long delay with the switch responding to patchmgr after a reload.

Log into the switch (ssh admin@<IP-ADDR>) and check for ports in the Administratively down state

rocea0#show interface brief

--------------------------------------------------------------------------------
Ethernet VLAN Type Mode Status Reason Speed Port
Interface Ch #
--------------------------------------------------------------------------------
Eth1/1 1 eth access down XCVR not inserted auto(D) --
Eth1/2 1 eth access down XCVR not inserted auto(D) --
Eth1/3 1 eth access down XCVR not inserted auto(D) --
Eth1/4 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/5 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/6 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/7 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/8 3888 eth access down Administratively down 100G(D) -- <--- Node ports 8-29 are shut down
Eth1/9 3888 eth access down Administratively down 100G(D) --
Eth1/10 3888 eth access down Administratively down 100G(D) --
Eth1/11 3888 eth access down Administratively down 100G(D) --
Eth1/12 3888 eth access down Administratively down 100G(D) --
Eth1/13 3888 eth access down Administratively down 100G(D) --
Eth1/14 3888 eth access down Administratively down 100G(D) --
Eth1/15 3888 eth access down Administratively down 100G(D) --
Eth1/16 3888 eth access down Administratively down 100G(D) --
Eth1/17 3888 eth access down XCVR not inserted 100G(D) --
Eth1/18 3888 eth access down Administratively down 100G(D) --
Eth1/19 3888 eth access down XCVR not inserted 100G(D) --
Eth1/20 3888 eth access down Administratively down 100G(D) --
Eth1/21 3888 eth access down Administratively down 100G(D) --
Eth1/22 3888 eth access down Administratively down 100G(D) --
Eth1/23 3888 eth access down Administratively down 100G(D) --
Eth1/24 3888 eth access down Administratively down 100G(D) --
Eth1/25 3888 eth access down Administratively down 100G(D) --
Eth1/26 3888 eth access down Administratively down 100G(D) --
Eth1/27 3888 eth access down Administratively down 100G(D) --
Eth1/28 3888 eth access down Administratively down 100G(D) --
Eth1/29 3888 eth access down Administratively down 100G(D) --
Eth1/30 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/31 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/32 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/33 1 eth trunk up none 100G(D) 100<--- In this case ISLs are up on a single rack
Eth1/34 1 eth access down XCVR not inserted auto(D) --
Eth1/35 1 eth access down XCVR not inserted auto(D) --
Eth1/36 1 eth access down XCVR not inserted auto(D) --

Cause

	To view full details, sign in with your My Oracle Support account.
	Don't have a My Oracle Support account? Click to get started!

In this Document

Symptoms

Cause

Solution

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.

Abnormal patchmgr Termination when upgrading RoCE Leaf Switches may leave Ports in a Shutdown State (Doc ID 2984407.1)

Applies to:

Symptoms

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!