Exadata RS process reports RS-7445 [Serv MS is absent] [It will be restarted] (Doc ID 1361316.1)

Last updated on NOVEMBER 19, 2013

Applies to:

Oracle Exadata Hardware - Version 11.2.1.2.1 to 11.2.1.2.1 [Release 11.2]
Linux x86-64
***Checked for relevance on 04-Apr-2013***

Symptoms

Exadata storage cell with the Exadata software version earlier than 11.2.2.4.0.

A notification like this is generated by a storage cell:

Event Time [date and time]
Description RS-7445 [Serv MS is absent] [It will be restarted]
Affected Cell Name [server name]
Server Model SUN MICROSYSTEMS SUN FIRE X4270 [M2] SERVER SATA [SAS]
Chassis Serial Number [serial number]
Release Version [cell software version]
Release Label OSS_11.2.0.3.0_LINUX.X64_[date]


A storage cell reports RS-7445 error in the cell alert log:

[date and time]
Errors in file /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/log/diag/asm/cell/[server name]/trace/rstrc_11492_5.trc (incident=1):
RS-7445 [Serv MS is absent] [It will be restarted] []
Incident details in: /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/log/diag/asm/cell/[server name]/incident/incdir_1/rstrc_11492_5_i1.trc
Sweep [inc][1]: completed
[RS] Started Service MS with pid 28805



NOTE: cell11.2.2.3.2_LINUX.X64_110520 corresponds to the cell software version 11.2.2.3.2. If you are running different version, that version will be in the log and trace files.


The RS trace file (in this case rstrc_11492_5.trc) contents would look like this:

*** [date and time]

Service MS no longer alive: ossrsutl_error_type: -1
MS process is not alive. Pid is missing.
Service MS with pid 12706 is not present
Serv MS is absent It will be restarted
-------------------------------------------------------------------------------
Trace Bucket Dump Begin: default trace bucket
TIME(*=approx):SEQ: DATA
-------------------------------------------------------------------------------
[date and time] :00000006: serv_status - 2, exit_status - 0, cur_inc_num - 0, prev_inc_num - 0
[date and time] :00000007: service MS already has status 2, modify to 1
[date and time] :00000015: monitoring process /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/cellsrv/bin/cellrsmmt (pid: 11498) returned with error: 0
[date and time] :0000005B: serv_status - 1, exit_status - 0, cur_inc_num - 0, prev_inc_num - 0
[date and time] :0000005C: service MS already has status 1, modify to 1
[date and time] :0000005D: Started monitoring process /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/cellsrv/bin/cellrsmmt with pid 12705
[date and time] :0000005E: socket open error: Port no: 8888. Received errorno 111. Connection refused
[date and time] :0000005F: MS process is not alive. Pid is missing.
[date and time] :00000060: Service MS was not alive, try starting
[date and time] :00000061: Exec new process /usr/java/jdk1.5.0_15//bin/java
[date and time] :00000062: socket open error: Port no: 8888. Received errorno 111. Connection refused
...
[date and time] :0000006A: Service MS with pid 12706 is not present
[date and time] :0000006B: Serv MS is absent It will be restarted
-------------------------------------------------------------------------------
Trace Bucket Dump End: default trace bucket
DDE: Flood control is not active
Incident 1 created, dump file: /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/log/diag/asm/cell/[server name]/incident/incdir_1/rstrc_11492_5_i1.trc
RS-7445 [Serv MS is absent] [It will be restarted] []
Exec new process /usr/java/jdk1.5.0_15//bin/java
socket open error: Port no: 8888. Received errorno 111. Connection refused


The RS incident trace file (in this case rstrc_11492_5_i1.trc) contents would look like this:

*** [date and time]
Dump continued from file: /opt/oracle/cell11.2.2.3.2_LINUX.X64_110520/log/diag/asm/cell/[server name]/trace/rstrc_11492_5.trc
RS-7445 [Serv MS is absent] [It will be restarted] []

========= Dump for incident 1 (RS 7445) ========
Starting a Diag Context default dump (level=3)

----- Call Stack Trace -----
kgdsdstsg()+48 call kgdsdst() 000000000 ? 000000000 ?
dbgc_dmp()+146 call kgdsdstsg() 00F7055B0 ? 0433D7CD0 ?
dbgexPhaseII()+1960 call dbgc_dmp() 00F7055B0 ? 000000003 ?
dbgexProcessError() call dbgexPhaseII() 00F703C90 ? 00F718760 ?
dbgeExecuteForError call dbgexProcessError() 00F703C90 ? 00F718760 ?
()+138 000000000 ? 000000000 ?
dbgePostErrorDirect call dbgeExecuteForError 00F703C90 ? 00F718760 ?
()+2502 () 000000000 ? 000000000 ?
ossrsutl_dump_incid call dbgePostErrorDirect 00F703C90 ? 00F718760 ?
ent()+253 () 000000000 ? 000000000 ?
000000000 ? 000000001 ?
ossrsutl_monitor_sr call ossrsutl_dump_incid 000000000 ? 00F705760 ?
vc()+8164 ent() 000000000 ? 000000000 ?
ossrsutl_monitor_sr call ossrsutl_monitor_sr 0433E1A54 ? 2B57508B5190 ?
vc_prc()+2440 vc() 000000014 ? 000000000 ?
sossrs_prc_start()+ call ossrsutl_monitor_sr 7FFFDB213B90 ? 2B57508B5190 ?
1224 vc_prc() 000000014 ? 000000000 ?
ossrsutl_monitor_mo call sossrs_prc_start() 00041A226 ? 7FFFDB213B90 ?
npr_thd()+1370 000000014 ? 000000000 ?


 Prior to MS ran out of memory, there could be two list metriccurrent commands like below, and then there would be OutOfMemoryError exceptions in ms-odl.trc and  ms.err :

 

 

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms