HA scaling issue with large VM clusters on shared port HCA

(Doc ID 2369186.1)

Last updated on MARCH 14, 2018

Applies to:

Linux OS - Version Oracle Linux 6.9 with Unbreakable Enterprise Kernel [4.1.12] and later
Linux x86-64

Symptoms

During the implementation of Multi Rack Cabling one of the leaf switches was shutdown.

The grid clusterware did not start with the following message:
CRS-8503: Oracle Clusterware GIPCD process with operating system process ID 11249 experienced fatal signal or exception code 11

During the issue occurrence , the load increase in the Oracle VM server was visible

top - 19:34:02 up 21 days, 49 min, 0 users, load average: 15.43, 7.72, 5.96
top - 19:34:07 up 21 days, 49 min, 0 users, load average: 16.60, 8.09, 6.09
top - 19:34:12 up 21 days, 49 min, 0 users, load average: 18.95, 8.72, 6.31
top - 19:35:03 up 21 days, 50 min, 0 users, load average: 91.12, 28.55, 13.09  --->>>
top - 19:35:08 up 21 days, 50 min, 0 users, load average: 96.16, 30.63, 13.85
top - 19:35:13 up 21 days, 50 min, 0 users, load average: 100.79, 32.68, 14.60
top - 19:35:18 up 21 days, 50 min, 0 users, load average: 105.29, 34.74, 15.37
top - 19:35:24 up 21 days, 50 min, 0 users, load average: 108.95, 36.67, 16.10

Oracle VM servers were automatically rebooted and did not come online again properly after the power-down of the switch. And below messages were on the /var/log/messages file:

Jan 18 19:33:17 xxx kernel: [1816851.096548] mlx4_core 0000:00:04.0: mlx4_ib: Port 2 logical link is down
Jan 18 19:33:17 xxx kernel: [1816851.096581] RDS/IB: PORT-EVENT: ERROR, PORT: mlx4_0/port_2/ib1 : port state transition to DOWN (portlayers 0x6)
Jan 18 19:33:17 xxx kernel: [1816851.105447] RDS/IB: NET-EVENT: NETDEV-CHANGE, PORT mlx4_0/port_2/ib1 : port state transition NONE - port retained in state DOWN (portlayers 0x4)
Jan 18 19:33:17 xxx kernel: [1816851.110115] RDS/IP: IP xxx.xx.x.x migrated from ib1 to ib0:P02
Jan 18 19:33:17 xxx kernel: [1816851.110152] RDS/IB: connection <xxx.xx.x.x,yyy.yy.yy.yy,0> dropped due to 'ADDR_CHANGE event'

Changes

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms