Grid Upgrade to 19c Fails During rootupgrade.sh. Nodes Reboot, Voting File Relocates, ASM Listener, and ASM Don't Start Automatically.
(Doc ID 2651972.1)
Last updated on OCTOBER 18, 2023
Applies to:
Oracle Database - Enterprise Edition - Version 19.1.0.0.0 and laterInformation in this document applies to any platform.
Symptoms
Upgrading from 12.1.0.2 to 19.5
- Script rootupgrade.sh is run as part of upgrade
- Nodes are evicted
- Permissions errors
- ASM fails to start
- Cluster fails to start
- ASM starts if done manually
- Rootupgrade.sh rerun on node where it didn't complete
- Cluster starts
- Upgrade verified to be successful
The below details may vary depending on what order the servers were upgraded and how many nodes in the cluster.
++ Upgrade to 19c completed on nodes 1,3,5,7
++ When running rootupgrade on node 2, all the nodes except node 4 are evicted.
++ After that cluster is not starting on any of the nodes.
++ Verified and found cluster is up and running only on node 4, but it was not responding to any of the crsctl commands
++ Tried restarting node 1 , but still it is failing because of connectivity failure with node 8.
++ Stopped the clusterware on all the nodes and start it one by one.
Stopped cluster on all the nodes , but still ocssd processes were running as orphan processes .
Killed these proceesses manually on all the nodes using kill -9 command.
++ Followed the workaround mentioned in the below doc to resolve this issue and rebooted the node once.
GI Upgrade From 12c To 18c Fails (Doc ID 2497014.1)
++ Now cluster on all the nodes were up and running without any issues
++ Restarted root upgrade on node 2.
Rootupgrade completed without any issues on node 2 and 4
++ Upgrade completed successfully on all the nodes:
$ dcli -l oracle -g /home/oracle/dbs_group ' /u01/app/19.0.0.0/grid/bin/crsctl query crs activeversion -f '
Node1: Oracle Clusterware active version on the cluster is [19.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [2737317647].
Node2: Oracle Clusterware active version on the cluster is [19.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [2737317647].
...
Node8: Oracle Clusterware active version on the cluster is [19.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [2737317647].
This same scenario happened on different clusters and for different customers.
Various Messages Seen:
=================
CRS-41053: checking Oracle Grid Infrastructure for file permission issues <-------- These privilege checking messages are a new feature of 19c and were not the cause of the issue.
> PRVH-0111 : Path "/u01/app/oracle/diag" with permissions "rwxr-x---" does not have read permissions for others on node "Node2".
> PRVH-0109 : Path "/u01/app/oracle/diag" with permissions "rwxr-x---" does not have write permissions for the file's group on node "Node2".
> PRVH-0113 : Path "/u01/app/oracle/diag" with permissions "rwxr-x---" does not have execute permissions for others on node "Node2".
"PRVF-5157 : Could not verify ASM group "DATA" for Voting Disk location" issue on Node 8 and Node 6 <-------- ASM was down and had to be started manually using "sqlplus sys / as sysasm"
CRS upgrade has completed.
Rolling upgrade of CRS is finished. Turning ON New PE
Node Node1 upgraded to version 19.0.0.0.0
CRS-8013: reboot advisory message text: oracssdmonitor is about to reboot this node due to no I/O completions with majority of voting disks.
osysmond shows:
CRFM:4011058944: Version mismatch. Expected: major=19 minor=3, Got: major=12 minor=0, j=0
Because ASM is not started there will also be errors related to anything where a file is not available on ASM such as...
KFOD-00301: Unable to contact Cluster Synchronization Services (CSS). Return code 2 from kgxgncin.
KFOD-00105: Could not open pfile 'init@.ora'
PRVG-2071 : Disk group for ocr location "+DATA" is not available on "Node8"
PRVF-4557 : Node application "ora.Node8.vip" is offline on node "Node8"
PRVF-4557 : Node application "ora.net1.network" is offline on node "Node8"
PRVF-5155 : Failure to retrieve ASM Disk Groups on node "Node8"
Messages seen in ocssd.trc prior to reboots
CSSD:2398549760: (:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 5 configured voting disks available, need 3 ocssd_73779.trc [ Node8 ]
CSSD:2398549760: (:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 57570 ms for voting file ...[ Node8]
CSSD:2398549760: (:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 57570 ms for voting file ...[ Node8]
Messages showing the voting files relocating in ASM alert logs
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. Found 5 voting file(s).
NOTE: Voting file relocation is required in diskgroup DATA <------------------ This is the main contributing issue - voting file relocation during the upgrade when the servers are temporarily on different GI versions
NOTE: Attempting voting file relocation on diskgroup DATA
NOTE: Successful voting file relocation on diskgroup DATA
Node 1 /var/log/messages shows 12 reboots of Node 1 alone over the course of 3 hours during upgrades
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Cause |
Solution |
References |