Exadata X9M-2 OEDA step 6 fail due to cellsrv keep crashing
(Doc ID 2800487.1)
Last updated on FEBRUARY 14, 2022
Applies to:
Oracle Exadata Storage Server Software - Version 21.2.1.0.0 and laterInformation in this document applies to any platform.
Symptoms
During OEDA depolyment step 6 (6. Create Cell Disks) the below error is encountered on Cell
Storage cell software keeps crashing:
[RS] Service CELLSRV with pid 33238 not responding
Errors in file /opt/oracle/cell/log/diag/asm/cell/<hostname>-adm/trace/rstrc_60273_omt.trc (incident=1):
RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/<hostname>-adm/incident/incdir_1/rstrc_60273_omt_i1.tr
56 2021-08-05T14:23:40-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
57 2021-08-05T14:24:45-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
58 2021-08-05T14:25:50-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
59 2021-08-05T14:27:00-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
60 2021-08-05T14:28:01-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
61 2021-08-05T14:29:07-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will not be restarted] [] [] [] [] [] [] [] [] [] []"
62 2021-08-05T14:29:07-04:00 critical "RS-7445 [CELLSRV service shutdown] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []"
The CellSrv Trace has the following messages:
---
2021-08-06 15:59:48.898 :00000048: host: *.*.*.21;*.*.*.22
2021-08-06 15:59:48.899 :0000004A: mon_proc_pid oldpid: 0
2021-08-06 15:59:48.906 :0000004B: mon_proc_pid newpid: 0
2021-08-06 15:59:48.906 :0000004C: mon_proc_pid2: 0
2021-08-06 15:59:48.906 :0000004D: return val for ossrsos_prep_monproc -63
2021-08-06 15:59:48.906 :0000004E: Missed a heartbeat for process CELLSRV or leaking memory, error: -75
2021-08-06 15:59:48.906 :0000004F: Service CELLSRV was not alive, try starting
2021-08-06 15:59:48.922 :00000050: ossrsos_get_all_cellsrv_processes: returned 0 PIDs
2021-08-06 15:59:48.922 :00000051: ossrsos_kill_all_server_processes: killing 0 cellsrv processes
2021-08-06 15:59:48.936 :00000052: ossrsos_get_all_ocl_processes: returned 0 PIDs
2021-08-06 15:59:48.936 :00000053: ossrsos_kill_all_server_processes: killing 0 celloflsrv processes
*******
RS trace has below errors
rstrc_27779_omt_i115.trc
******
ORACLE_HOME: /opt/oracle/cell
System name: Linux
Node name: <hostname>.gratiscard.com
Release: 4.14.35-2047.502.5.el7uek.x86_64
Version: #2 SMP Wed Apr 14 15:08:41 PDT 2021
Machine: x86_64
CELL SW Version: OSS_21.2.1.0.0_LINUX.X64_210608
*** 2021-08-05 14:24:03.252
[TOC00000]
Jump to table of contents
Dump continued from file: /opt/oracle/cell/log/diag/asm/cell/<hostname>/trace/rstrc_27779_omt.trc
[TOC00001]
RS-7445 [Serv CELLSRV hang detected] [It will not be restarted] [] [] [] [] [] [] [] [] [] []
*****
messages files has below RDS dropped connections
****
Aug 6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.1,0> dropped due to 'DISCONNECTED event'
Aug 6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.2,0> dropped due to 'DISCONNECTED event'
Aug 6 03:41:23 <hostname> kernel: RDS/IB: Passive conn ffff9607a9d8c000 i_cm_id ffff960a54abf000, frag 16KB, connected <::ffff:*.*.*.23,::ffff:*.*.*.1,0> version 4.1
Aug 6 03:41:23 <hostname> kernel: RDS/IB: Passive conn ffff9607a9ca8138 i_cm_id ffff960a54ab8400, frag 16KB, connected <::ffff:*.*.*.23,::ffff:*.*.*.2,0> version 4.1
Aug 6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.7,0> dropped due to 'DISCONNECTED event'
Aug 6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.8,0> dropped due to 'DISCONNECTED event'
*****
ifconfig show RX errros for ROCE interfaces
*****
****************re0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2300
inet *.*.*.25 netmask 255.255.252.0 broadcast *.*.*.255
ether b8:ce:f6:21:6d:2a txqueuelen 1000 (Ethernet)
RX packets 184257 bytes 13585232 (12.9 MiB)
RX errors 17930650 dropped 0 overruns 0 frame 17930650 >>>>>>>>>>>>RX errors
TX packets 181480 bytes 13418650 (12.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
***************
jumbo frames pinging between the nodes did not work on the private network
ping -s 8690 <private IP>
Changes
New Deployment
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |