My Oracle Support Banner

12.2: Instance Termination When Second Instance Tries to Join (Doc ID 2432859.1)

Last updated on AUGUST 29, 2018

Applies to:

Oracle Database - Enterprise Edition - Version 12.2.0.1 to 12.2.0.1 [Release 12.2]
Information in this document applies to any platform.

Symptoms

On a 12.2.0.1 Real Application Cluster in Solaris environment, instance 2 currently running gets terminated when Instance 1 tries to join (started).

Running instance 2 terminated during reconfiguration while joining instance 1

alert_ORCL2.log (inside <oracle base>/diag/rdbms/orcl/ORCL2/trace):

2018-07-20T03:49:44.645319+03:00
Reconfiguration started (old inc 2, new inc 4)
List of instances (total 2) :
1 2
New instances (total 1) :
1
My inst 2
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2018-07-20T03:49:45.141915+03:00
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2018-07-20T03:49:45.142721+03:00
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2018-07-20T03:49:51.664847+03:00
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
2018-07-20T03:55:04.202351+03:00
minact-scn: useg scan erroring out with error e:12751
2018-07-20T03:55:24.833171+03:00
LMON (ospid: 29905): terminating the instance due to error 481

 

Instance 1 startup shows reconfiguration started messages and terminated itself

alert_ORCL1.log (inside <oracle base>/diag/rdbms/orcl/ORCL1/trace):

2018-07-20T03:48:56.897647+03:00
Starting ORACLE instance (normal) (OS id: 13868)

.
2018-07-20T03:49:44.644278+03:00
Reconfiguration started (old inc 0, new inc 4)
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)

Global Resource Directory frozen
Communication channels reestablished
2018-07-20T03:49:44.966673+03:00
* domain 0 valid = 1 (flags x0, pdb flags x0) according to instance 2
2018-07-20T03:49:44.973578+03:00
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2018-07-20T03:49:45.006139+03:00
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2018-07-20T03:49:45.006664+03:00
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
2018-07-20T03:49:45.102941+03:00
Set master node info
2018-07-20T03:49:51.673913+03:00
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
2018-07-20T03:54:01.090692+03:00
Instance reconfiguration could not complete in time.
Terminate itself as it is still unmounted.
Please check LMON trace file for more detail.
2018-07-20T03:54:01.094309+03:00
Starting background process LCK0
.
.
2018-07-20T03:55:25.145248+03:00
Dumping diagnostic data in directory=[cdmp_20180720035524], requested by (instance=2, osid=4294997201 (LMON)), summary=[abnormal instance termination].
2018-07-20T03:55:31.714649+03:00
Reconfiguration started (old inc 4, new inc 6)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1

 

  

At a time only one instance can be up and running. As in when instance 2 is shutdown instance 1 can be started and vice versa. 

ORCL2_lmon_29905.trc shows Reconfiguration timeout:

2018-07-20 03:49:44.644 :
Reconfiguration started (old inc 2, new inc 4)
* instance join = TRUE (1 new instances)
List of instances (total 2) :
1 2
New instances (total 1) :
1
My inst 2
.
2018-07-20 03:49:52.351 : * Begin lmon rcfg step KJGA_RCFG_ENQGRANT
All grantable enqueues granted
2018-07-20 03:49:52.410 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY

*** 2018-07-20T03:51:14.009355+03:00
2018-07-20 03:51:14.008 : * kjfclmsync: waited 82 secs for rmv/lms0 to finish parallel work pwait 1 pstep 10 nreplays 0

*** 2018-07-20T03:51:25.009651+03:00
2018-07-20 03:51:25.009 : * kjfclmsync: waited 93 secs for rmv/lms0 to finish parallel work pwait 1 pstep 10 nreplays 0
.
*** 2018-07-20T03:55:19.008761+03:00
2018-07-20 03:55:19.008 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance
* step 21 (rcfg step 21, drm step 0), state xa, flags x2000
* lms 0: pwfm 1, pwfl 1, prstep 10, pdstep 0, nrpl 0
* wfm 0, wfl 0
* lms 1: pwfm 1, pwfl 1, prstep 10, pdstep 0, nrpl 0
* wfm 0, wfl 0
DUMP state for lms0 (ospid 29907)
Dumping process 23.29907 info:

*** 2018-07-20T03:55:19.009407+03:00
Process diagnostic dump for oracle@racnode2 (LMS0), OS id=29907,
pid: 23, proc_ser: 1, sid: 4349, sess_ser: 11546
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 03:55:18 ]
NOTE: scheduling delay has not been sampled for 0.434670 secs
0.000000 secs from [ 03:55:14 - 03:55:19 ], 5 sec avg
0.000000 secs from [ 03:54:19 - 03:55:19 ], 1 min avg
0.000000 secs from [ 03:50:19 - 03:55:19 ], 5 min avg

ORCL1_lmon_14281.trc file of First INstance shows below errors and it was terminated:

2018-07-20 03:49:52.410 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY

*** 2018-07-20T03:49:52.541563+03:00

* kjfclmsync: lmses complete parallel work in step 21.0.0
send ftd(21) to 2:
lms 0 replayed 0
lms 1 replayed 0
total out rply cnt 0
2018-07-20 03:49:52.542 : Submitted all GCS cache requests

*** 2018-07-20T03:54:01.072642+03:00
2018-07-20 03:54:01.072 : * kjfcrfg: waited 249 secs for LMD to receive all ftdones
2018-07-20 03:54:01.073 : kjfsprn: sync status inst 1 tmout 0 (sec)
2018-07-20 03:54:01.074 : kjfsprn: sync propose inc 0 level 0
2018-07-20 03:54:01.074 : kjfsprn: ftdone bitmap ver 4 (step 22.0/0.0)
.

*** 2018-07-20T03:55:31.658239+03:00
Name Service frozen
kjxgmcs: Setting state to 4 1.
kjxgrssvote: reconfig bitmap chksum 0x34b12 cnt 1 master 1 ret 0
kjxgrpropmsg: SSMEMI: inst 1, st 0x0003, cap 0x3 - disk vote required
kjxgrpropmsg: SSVOTE: Master indicates Disk Voting required
2018-07-20 03:55:31.659 : kjxgrDiskVote: nonblocking method is chosen
kjxggpoll: change db group poll time to 0 ms
2018-07-20 03:55:31.659 : kjxgrDiskVote: start the disk vote w/ seqno 5
2018-07-20 03:55:31.659 : kjxgrDiskVote: skip the disk vote as it is unmounted
kjxggpoll: change db group poll time to 50 ms
kjxgrpropstchg: SSAPPR-Disk Voting Flag reset
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmmeminfo: can not invalidate inst 2
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 6 2.
kjfmSendAbortInstMsg: send an abort message to instance 2

Netstat output (from oswatcher) for the first node shows Huge increase in udpInCksumErrs and udpInOverflows

racnode1_netstat_18.07.20.0300.dat (first node):

zzz ***Fri Jul 20 03:48:12 AST 2018


IPv4

tcpInErrs = 0 udpNoPorts = 1219
udpInCksumErrs = 0 udpInOverflows = 0 <<< when polled at Jul 20 03:48

zzz ***Fri Jul 20 03:55:12 AST 2018
IPv4

tcpInErrs = 0 udpNoPorts = 1511
udpInCksumErrs = 304 udpInOverflows = 69263 <<< when polled at Jul 20 03:48 

Changes

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.