11.2.0.1 Grid Infrastructure Installation Failed at Second Nodes While Running root.sh Due To ASM Crash Caused by lmon Timeout (Doc ID 1239123.1)

Last updated on OCTOBER 12, 2016

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.
***Checked for relevance on 29-May-2013***

Symptoms

While installing Oracle Grid Infrastructure 11.2.0.1, root.sh has run successfully on the first node, but failed on the second node, indicating:  The OCR location in an ASM disk group is inaccessible.

alert_nodename.log:

2010-08-26 19:16:15.416
[cssd(17484)]CRS-1605:CSSD voting file is online: /db/app/oracle/ocr_vote_n01; details in /db/app/crs
/11.2_Grid_Home/log/rmodbd03/cssd/ocssd.log.
2010-08-26 19:16:17.432
[cssd(17484)]CRS-1601:CSSD Reconfiguration complete. Active nodes are d02 d03 .
2010-08-26 19:16:19.057
[ctssd(17512)]CRS-2403:The Cluster Time Synchronization Service on host d03 is in observer mode.
2010-08-26 19:16:19.063
[ctssd(17512)]CRS-2407:The new Cluster Time Synchronization Service reference node is host d02.
2010-08-26 19:16:19.961
[ctssd(17512)]CRS-2401:The Cluster Time Synchronization Service started on host d03.
2010-08-26 19:21:22.696
[ohasd(15890)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.asm'. Details at
(:CRSPE00111:) in /db/app/crs/11.2_Grid_Home/log/rmodbd03/ohasd/ohasd.log.
2010-08-26 19:21:24.798
[crsd(19090)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /db/app/crs/11.2_Grid_Home/log/rmodbd03/crsd/crsd.log.
2010-08-26 19:21:25.427
[ohasd(15890)]CRS-2765:Resource 'ora.crsd' has failed on server 'd03'.
2010-08-26 19:21:26.523
[crsd(19119)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /db/app/crs/11.2_Grid_Home/log/rmodbd03/crsd/crsd.log.



alert_+ASM2.log shows:

Thu Aug 26 19:16:25 2010
Reconfiguration started (old inc 0, new inc 4)
ASM instance
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Thu Aug 26 19:21:57 2010
IPC Send timeout detected. Sender: ospid 17593 [oracle@rmodbd03 (PING)]
Receiver: inst 1 binc 63701371 ospid 7549
Thu Aug 26 19:22:16 2010
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
LMS0 (ospid: 17603): terminating the instance due to error 481



The lmon trace shows:

SKGXP:[fffffd7ffcbecd28.6]:[ctx]: (ms) prev wait(ms) before
SKGXP:[fffffd7ffcbecd28.7]:[ctx]: --------- -------------- ----------- --------- -----------
SKGXP:[fffffd7ffcbecd28.8]:[ctx]: 88 0 0 NORMAL TIMEDOUT
SKGXP:[fffffd7ffcbecd28.9]:[ctx]: 80 0 0 NORMAL TIMEDOUT
SKGXP:[fffffd7ffcbecd28.10]:[ctx]: 88 0 0 NORMAL TIMEDOUT

SKGXP:[fffffd7ffcbecd28.35]:[ctx]: admno 0x3911544a admport:
SKGXP:[fffffd7ffcbecd28.36]:[ctx]: SSKGXPT 0xfcbee024 flags SSKGXPT_LOCAL_PORT sockno 10 IP 192.168.1.78 UDP 40467

SKGXP:[fffffd7ffcbecd28.70]:[ctx]: flags=8 nreqs=1100 free_rbufs=1100 msgsz=8192 min_frag_sz_ach=8192
SKGXP:[fffffd7ffcbecd28.71]:[ctx]: OS Level Port
SKGXP:[fffffd7ffcbecd28.72]:[ctx]: SSKGXPT 0xfca36a80 flags SSKGXPT_LOCAL_PORT sockno 25 IP 192.168.1.178 UDP 40469
SKGXP:[fffffd7ffcbecd28.73]:[ctx]: OS Level Port ID
SKGXP:[fffffd7ffcbecd28.74]:[ctx]: SKGXPGPID Internet address 192.168.1.78 UDP port number 40469
SKGXP:[fffffd7ffcbecd28.317]:[obj]: SSKGXPT 0xfca2352c flags SSKGXPT_WRITE sockno 10 IP 192.168.1.162 UDP 63320
SKGXP:[fffffd7ffcbecd28.318]:[obj]: Remote data port
SKGXP:[fffffd7ffcbecd28.319]:[obj]: SSKGXPT 0xfca23598 flags SSKGXPT_WRITE sockno 10 IP 192.168.1.162 UDP 63322
SKGXP:[fffffd7ffcbecd28.320]:[obj]: next seqno 32770 last ack 32765 credits 3 total credits 8 ertt 16 resends on con 116390

SKGXP:[fffffd7ffcbecd28.70]:[ctx]: flags=8 nreqs=1100 free_rbufs=1100 msgsz=8192 min_frag_sz_ach=8192
ICMP Time exceeded during reassembly from bd02 (192.168.1.78)


The package size is 8k, the timeout of which matches the ping err message:

It is due to the package size 8k package cannot go through the network.  This can be caused by the fact that the MTU size setting at NIC is appropriate for using jumbo frames but the MTU size setting is not right at the switch.

Note that this was an issue in versions prior to 11gR2, it would show as CRS hang on the second node.  Since 11gR2 Grid Infrastructure includes ASM, the symptom shows as an ASM crash due to the lmon timeout.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms