Solaris: Corrupt Packets on the Network causes CSS to REBOOT NODE (Doc ID 400778.1)

Last updated on JUNE 02, 2011

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.2 to 10.2.0.2 - Release: 10.2 to 10.2
Oracle Solaris on x86-64 (64-bit)
Sun Solaris x86-64 (64-bit)

Symptoms

Corrupt TCP packets affect CSS Heartbeat and Voting Disk thus causing the Node to reboot. Using tools network tools 'pins -s' or traceroute -r -F <IP Address> as per the OWS Watcher will not find corrupt packets but check the link of the network and latencies. In this case we had a corrupt packet and using tools like :-

# snoop -o /var/tmp/snoop.out not port 1234

This would help the OS vendor a lot in verifying the TCP packets used by the Heartbeat

Case 1:

 Here we get a corrupt TCP packet affecting the CSS and Heartbeat , thus forcing node2 to reboot/evict .

ocssd.log - node 1

[    CSSD]2006-11-20 22:07:21.821 [18] >TRACE:   clssnmWaitThread: thrd(2),
timeout(1000), wakeonpost(0)
[    CSSD]2006-11-20 22:07:21.910 [13] >TRACE:   clsc_receive: (ded0a0)
messgae of size 0
[   CSSD]2006-11-20 22:07:21.910 [13] >TRACE: clsc_receive: (ded0a0) error 2
 

[    CSSD]2006-11-20 22:07:21.910 [13] >WARNING: clssnmeventhndlr: Receive
failure with node 2 (node2), rc=2
[    CSSD]2006-11-20 22:07:21.910 [13] >TRACE:   clssnmDiscHelper: node
lsymxdbi25 (2) connection failed
[    CSSD]2006-11-20 22:07:21.910 [13] >TRACE:   clssnmDiscHelper: clean up
existing connection.

osccd.log - node 2

Entering select blocking
[    CSSD]2006-11-20 22:07:27.199 [9] >TRACE:   clssnmvDiskKillCheck:
Checking for kill
[    CSSD]2006-11-20 22:07:27.200 [9] >ERROR:   clssnmvDiskKillCheck: Evicted
by node 1, sync 21, stamp 557798570,

The packet was send from node2 cleanly to node1, however the packet got corruption in between and node1. We have received a zero length message caused by the network. When we get this problem on the connection we disconnect it and as such force  that node to be Evicted using the Kill block on the Vote Disk to communicate and reboot/evict node2.

Refer to Solution 1

Case 2:

Here everything appears quite normal for an eviction where we lose network
connection. Both nodes report missed checkins. Node 2 gets evicted.

ocssd.log - lsymxdbi24

[    CSSD]2006-11-11 20:33:40.290 [14] >TRACE:   clssgmClientConnectMsg: 
Connect from con(e1d040) proc(e22ea0) pid() proto(10:2:1:1)
[    CSSD]2006-11-11 20:33:59.575 [18] >TRACE:   clssnmPollingThread: node 
lsymxdbi24 (1) missed(2) checkin(s)
[    CSSD]2006-11-11 20:34:13.615 [18] >TRACE:   clssnmPollingThread: node 
lsymxdbi25 (2) missed(2) checkin(s)
......
...
.
[    CSSD]2006-11-11 20:34:41.846 [18] >TRACE:   clssnmPollingThread: node 
lsymxdbi25 (2) is impending reconfig
[    CSSD]2006-11-11 20:34:41.846 [18] >TRACE:   clssnmPollingThread: 
Eviction started for node node2 (2), flags 0x000d, state 3, wt4c 0
[    CSSD]2006-11-11 20:34:41.846 [20] >TRACE:   clssnmDoSyncUpdate: 
Initiating sync 3
 .
ocssd.log - lsymxdbi25

[    CSSD]2006-11-11 20:33:29.064 [14] >TRACE:   clssgmClientConnectMsg: 
Connect from con(e40620) proc(e1eb60) pid() proto(10:2:1:1)
 [    CSSD]2006-11-11 20:33:40.238 [18] >TRACE:   clssnmPollingThread: node 
node1 (2) missed(2) checkin(s)
 [    CSSD]2006-11-11 20:34:13.538 [18] >TRACE:   clssnmPollingThread: node 
lsymxdbi24 (1) missed(2) checkin(s)
 [    CSSD]2006-11-11 20:34:14.548 [18] >TRACE:   clssnmPollingThread: node 
node1 (1) missed(3) checkin(s)
 ....
 ...
 ..
 [    CSSD]2006-11-11 20:34:41.817 [18] >TRACE:   clssnmPollingThread: 
Eviction started for node node2 (1), flags 0x000f, state 3, wt4c 0
 [    CSSD]2006-11-11 20:34:41.817 [20] >TRACE:   clssnmDoSyncUpdate: 
Initiating sync 3
 [    CSSD]2006-11-11 20:34:41.817 [20] >TRACE:   clssnmDoSyncUpdate: 
diskTimeout set to (27000)ms
 ....
 ..
 .
[    CSSD]2006-11-11 20:34:45.773 [8] >TRACE:   clssnmReadDskHeartbeat: 
node(1) is down. rcfg(3) wrtcnt(101555) LATS(109213095) Disk 
lastSeqNo(101555)
[    CSSD]2006-11-11 20:34:45.857 [20] >ERROR:   clssnmCheckDskInfo: 
Terminating local instance to avoid splitbrain.
[    CSSD]2006-11-11 20:34:45.857 [20] >ERROR:                 : Node(2), 
Leader(2), Size(1) VS Node(1), Leader(1), Size(1)
[    CSSD]2006-11-11 20:34:45.857 [20] >TRACE:   clssscctx:  dump of 
0x520740, len 3808
[ CSSD]2006-11-11 20:34:45.857 [20] >TRACE: 0x520740 b0 93 84 00 00 00 
00 00 - 40 13 73 00 00 00 00 00 ........@.s.....
[    CSSD]2006-11-11 20:34:45.857 [20] >TRACE:   0x520750 00 00 00 00 00 00 
00 00 - 70 05 52 00 00 00 00 00 ........p.R.....

Refer to Solution 2

Changes

Using e1000 (Driver Version 5.0.9)
Solaris 10 (118855-19) on X86-64

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms