Stale o2net Sockets Prevent Oracle VM Server From Rejoining the Cluster (Doc ID 1600513.1)

Last updated on APRIL 17, 2017

Applies to:

Oracle VM - Version 3.2.1 to 3.2.9 [Release OVM32]
Linux x86-64

Symptoms

Oracle VM Server has rebooted, but now it can not join the poolfs cluster.

The following messages can be seen in ovs-agent logs:

[2013-09-17 16:04:36 5196] WARNING (startup:27) Error init pool filesystem: Command: ['mount', '/dev/mapper/ovspoolfs', '/poolfsmnt/0004fb00000500009c7839f48785f23f'] failed (1): stderr: mount.ocfs2: Invalid argument while mounting /dev/mapper/ovspoolfs on /poolfsmnt/0004fb00000500009c7839f48785f23f. Check 'dmesg' for more information on this error.

System logs on the server show below messages: 

...
Sep 17 16:04:11 server6 kernel: device-mapper: nfs: version 1.0.0 loaded
Sep 17 16:04:12 server6 o2cb.init: online b33a10c3ab918346
Sep 17 16:04:24 server6 kernel: o2hb: Heartbeat started on region 0004FB00000500009C7839F48785F23F (dm-2)
Sep 17 16:04:24 server6 kernel: o2cb: This node is not connected to nodes: 0.
...


The messages show that Node 0 (which is server5 in this example) is not connected to the cluster. Below messages can be seen on server5:

...
Sep 17 16:04:16 server5 kernel: o2net: Attempt to connect from node 'server6' at 192.168.1.21:58492 but it already has an open connection
Sep 17 16:04:16 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 shutdown, state 7
Sep 17 16:04:16 server5 kernel: o2net: Accepted connection from server6 (num 4) at 192.168.1.21:7777
Sep 17 16:04:16 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
... 

Looking at the output of "netstat -an" command on node 0 (server5) it is clear that the connection to all of the nodes exist in the cluster, except for this one which has a socket stuck in CLOSE_WAIT status.

...
tcp       33      0 17.171.176.20:7777 17.171.176.21:56221         CLOSE_WAIT  0          0 -                   off (0.00/0/0) 
tcp       33      0 17.171.176.20:7777 17.171.176.21:47576         CLOSE_WAIT  0          0 -                   off (0.00/0/0) 
tcp       33      0 17.171.176.20:7777 17.171.176.21:58210         CLOSE_WAIT  0          0 -                   off (0.00/0/0) 
tcp       33      0 17.171.176.20:7777 17.171.176.21:35170         CLOSE_WAIT  0          0 -                   off (0.00/0/0)
... 

Node 0 (Server5) has evicted Node 4 (Server6)

...
Sep 17 14:08:57 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 has been idle for 60.56 secs, shutting it down.
Sep 17 14:08:57 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
Sep 17 14:09:57 server5 kernel: o2net: No connection established with node 4 after 60.0 seconds, giving up.
Sep 17 14:09:59 server5 kernel: o2cb: o2dlm has evicted node 4 from domain ovm
Sep 17 14:09:59 server5 kernel: o2cb: o2dlm has evicted node 4 from domain 0004FB00000500009C7839F48785F23F
... 

After Node 4 boots up, Node 0 shows that it is not in the cluster:

...
Sep 17 16:04:16 server5 kernel: o2net: Attempt to connect from node 'server6' at 192.168.1.21:58492 but it already has an open connection
Sep 17 16:04:16 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 shutdown, state 7
Sep 17 16:04:16 server5 kernel: o2net: Accepted connection from node server6 (num 4) at 192.168.1.21:7777
Sep 17 16:04:16 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
... 

 

Changes

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms