My Oracle Support Banner

Exadata X9M-2 OEDA step 6 fail due to cellsrv keep crashing (Doc ID 2800487.1)

Last updated on FEBRUARY 14, 2022

Applies to:

Oracle Exadata Storage Server Software - Version 21.2.1.0.0 and later
Information in this document applies to any platform.

Symptoms

During OEDA depolyment step 6 (6. Create Cell Disks) the below error is encountered on Cell

Storage cell software keeps crashing:
[RS] Service CELLSRV with pid 33238 not responding
Errors in file /opt/oracle/cell/log/diag/asm/cell/<hostname>-adm/trace/rstrc_60273_omt.trc (incident=1):
RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell/log/diag/asm/cell/<hostname>-adm/incident/incdir_1/rstrc_60273_omt_i1.tr


  56 2021-08-05T14:23:40-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
  57 2021-08-05T14:24:45-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
  58 2021-08-05T14:25:50-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
  59 2021-08-05T14:27:00-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
  60 2021-08-05T14:28:01-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []"
  61 2021-08-05T14:29:07-04:00 critical "RS-7445 [Serv CELLSRV hang detected] [It will not be restarted] [] [] [] [] [] [] [] [] [] []"
  62 2021-08-05T14:29:07-04:00 critical "RS-7445 [CELLSRV service shutdown] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []"

The CellSrv Trace has the following messages:
---
2021-08-06 15:59:48.898 :00000048: host: *.*.*.21;*.*.*.22
2021-08-06 15:59:48.899 :0000004A: mon_proc_pid oldpid: 0
2021-08-06 15:59:48.906 :0000004B: mon_proc_pid newpid: 0
2021-08-06 15:59:48.906 :0000004C: mon_proc_pid2: 0
2021-08-06 15:59:48.906 :0000004D: return val for ossrsos_prep_monproc -63
2021-08-06 15:59:48.906 :0000004E: Missed a heartbeat for process CELLSRV or leaking memory, error: -75
2021-08-06 15:59:48.906 :0000004F: Service CELLSRV was not alive, try starting
2021-08-06 15:59:48.922 :00000050: ossrsos_get_all_cellsrv_processes: returned 0 PIDs
2021-08-06 15:59:48.922 :00000051: ossrsos_kill_all_server_processes: killing 0 cellsrv processes
2021-08-06 15:59:48.936 :00000052: ossrsos_get_all_ocl_processes: returned 0 PIDs
2021-08-06 15:59:48.936 :00000053: ossrsos_kill_all_server_processes: killing 0 celloflsrv processes


*******

RS trace has below errors

rstrc_27779_omt_i115.trc

******


ORACLE_HOME:    /opt/oracle/cell
System name: Linux
Node name: <hostname>.gratiscard.com
Release: 4.14.35-2047.502.5.el7uek.x86_64
Version: #2 SMP Wed Apr 14 15:08:41 PDT 2021
Machine: x86_64
CELL SW Version: OSS_21.2.1.0.0_LINUX.X64_210608

*** 2021-08-05 14:24:03.252
[TOC00000]
Jump to table of contents
Dump continued from file: /opt/oracle/cell/log/diag/asm/cell/<hostname>/trace/rstrc_27779_omt.trc
[TOC00001]
RS-7445 [Serv CELLSRV hang detected] [It will not be restarted] [] [] [] [] [] [] [] [] [] []

 

*****

messages files has below RDS dropped connections

****

 

Aug  6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.1,0> dropped due to 'DISCONNECTED event'
Aug  6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.2,0> dropped due to 'DISCONNECTED event'
Aug  6 03:41:23 <hostname> kernel: RDS/IB: Passive conn ffff9607a9d8c000 i_cm_id ffff960a54abf000, frag 16KB, connected <::ffff:*.*.*.23,::ffff:*.*.*.1,0> version 4.1
Aug  6 03:41:23 <hostname> kernel: RDS/IB: Passive conn ffff9607a9ca8138 i_cm_id ffff960a54ab8400, frag 16KB, connected <::ffff:*.*.*.23,::ffff:*.*.*.2,0> version 4.1
Aug  6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.7,0> dropped due to 'DISCONNECTED event'
Aug  6 03:41:23 <hostname> kernel: RDS/IB: connection <::ffff:*.*.*.23,::ffff:*.*.*.8,0> dropped due to 'DISCONNECTED event'

*****

ifconfig show RX errros for ROCE interfaces

*****

 

****************re0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2300
       inet *.*.*.25  netmask 255.255.252.0  broadcast *.*.*.255
       ether b8:ce:f6:21:6d:2a  txqueuelen 1000  (Ethernet)
       RX packets 184257  bytes 13585232 (12.9 MiB)
       RX errors 17930650  dropped 0  overruns 0  frame 17930650  >>>>>>>>>>>>RX errors
       TX packets 181480  bytes 13418650 (12.7 MiB)
       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 

***************

jumbo frames pinging between the nodes did not work on the private network

 

ping -s 8690 <private IP>

 

 

Changes

 New Deployment

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.