Stale o2net Sockets Prevent Oracle VM Server From Rejoining the Cluster
(Doc ID 1600513.1)
Last updated on OCTOBER 13, 2021
Applies to:
Oracle VM - Version 3.2.1 to 3.2.9 [Release OVM32]Oracle Cloud Infrastructure - Version N/A and later
Linux x86-64
Symptoms
Oracle VM Server has rebooted, but now it can not join the poolfs cluster.
The following messages can be seen in ovs-agent logs:
[2013-09-17 16:04:36 5196] WARNING (startup:27) Error init pool filesystem: Command: ['mount', '/dev/mapper/ovspoolfs', '/poolfsmnt/0004fb00000500009c7839f48785f23f'] failed (1): stderr: mount.ocfs2: Invalid argument while mounting /dev/mapper/ovspoolfs on /poolfsmnt/0004fb00000500009c7839f48785f23f. Check 'dmesg' for more information on this error.
System logs on the server show below messages:
...
Sep 17 16:04:11 server6 kernel: device-mapper: nfs: version 1.0.0 loaded
Sep 17 16:04:12 server6 o2cb.init: online b33a10c3ab918346
Sep 17 16:04:24 server6 kernel: o2hb: Heartbeat started on region 0004FB00000500009C7839F48785F23F (dm-2)
Sep 17 16:04:24 server6 kernel: o2cb: This node is not connected to nodes: 0.
...
The messages show that Node 0 (which is server5 in this example) is not connected to the cluster. Below messages can be seen on server5:
...
Sep 17 16:04:16 server5 kernel: o2net: Attempt to connect from node 'server6' at 192.168.1.21:58492 but it already has an open connection
Sep 17 16:04:16 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 shutdown, state 7
Sep 17 16:04:16 server5 kernel: o2net: Accepted connection from server6 (num 4) at 192.168.1.21:7777
Sep 17 16:04:16 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
...
Looking at the output of "netstat -an" command on node 0 (server5) it is clear that the connection to all of the nodes exist in the cluster, except for this one which has a socket stuck in CLOSE_WAIT status.
...
tcp 33 0 192.168.1.20:7777 192.168.1.21:56221 CLOSE_WAIT 0 0 - off (0.00/0/0)
tcp 33 0 192.168.1.20:7777 192.168.1.21:47576 CLOSE_WAIT 0 0 - off (0.00/0/0)
tcp 33 0 192.168.1.20:7777 192.168.1.21:58210 CLOSE_WAIT 0 0 - off (0.00/0/0)
tcp 33 0 192.168.1.20:7777 192.168.1.21:35170 CLOSE_WAIT 0 0 - off (0.00/0/0)
...
Node 0 (Server5) has evicted Node 4 (Server6)
...
Sep 17 14:08:57 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 has been idle for 60.56 secs, shutting it down.
Sep 17 14:08:57 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
Sep 17 14:09:57 server5 kernel: o2net: No connection established with node 4 after 60.0 seconds, giving up.
Sep 17 14:09:59 server5 kernel: o2cb: o2dlm has evicted node 4 from domain ovm
Sep 17 14:09:59 server5 kernel: o2cb: o2dlm has evicted node 4 from domain 0004FB00000500009C7839F48785F23F
...
After Node 4 boots up, Node 0 shows that it is not in the cluster:
...
Sep 17 16:04:16 server5 kernel: o2net: Attempt to connect from node 'server6' at 192.168.1.21:58492 but it already has an open connection
Sep 17 16:04:16 server5 kernel: o2net: Connection to node server6 (num 4) at 192.168.1.21:7777 shutdown, state 7
Sep 17 16:04:16 server5 kernel: o2net: Accepted connection from node server6 (num 4) at 192.168.1.21:7777
Sep 17 16:04:16 server5 kernel: o2net: No longer connected to node server6 (num 4) at 192.168.1.21:7777
...
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Cause |
Solution |
References |