Database hangs after adding more instances but not changing Huge Page configuration : Process W001 Died, Process J000 Died Instance Unreachable

(Doc ID 1448833.1)

Last updated on APRIL 23, 2013

Applies to:

Oracle Exadata Storage Server Software - Version 11.2.2.4.2 and later
Information in this document applies to any platform.

Symptoms

Users are unable to log into the database which had previously been reachable
CRS cannot communicate between instances
CRS shows as up
Instances show as up

Further review in the ALERT.LOG show processes dying
- no hard errors are seen in any logfiles


From the ALERT.LOG
------------------------------------------
Thu Apr 05 04:05:22 2012
Process W001 died, see its trace file
Process W001 died, see its trace file
Process W001 died, see its trace file
Thu Apr 05 04:10:27 2012
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/prod/prod1/trace/proddb_cjq0_11131.trc:
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/prod/prod1/trace/proddb_cjq0_11131.trc:
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/prod/prod1/trace/proddb1_cjq0_11131.trc:



In our users problem two of their 4 instances are now unreachable:

SQL> select inst_id,instance_name from gv$instance;

INST_ID INSTANCE_NAME
---------- ----------------
3 prod03
4 prod04



But crs thinks everything is online:

crsctl stat res ora.proddb -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.proddb

1 ONLINE ONLINE  db1-4 Open
2 ONLINE ONLINE  db1-3 Open
3 ONLINE ONLINE  db1-2 Open
4 ONLINE ONLINE  db1-1 Open


There is an absence of error or trace information directly pointing to the source of the problem


Further investigation revealed key details pointing to the potential problem source
  • Hugepages had recently been implemented
  • Size / number of hugepages created was tuned for two existing instances (or some other fixed number of instances)
  • After setting HugePages and very large SGA sizes for the two existing instances (e.g. 256gb) the user also added more instances on the same nodes but did not tune Hugepages to take into account these new instances
  • Once the existing nodes were unreachable other nodes showed the same problem
  • On occassion the user may be able to associate the problem with the last instance(s) started
  • It may also possible to encounter the same symptoms due to incorrect Hugepage configuration without adding instances

Changes

Recently implemented Huge Pages
Recently added instances

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms