ODA Servers Panicked And Rebooted After Changing Core Count And Crs Now Fails To Start
(Doc ID 2678132.1)
Last updated on FEBRUARY 05, 2024
Applies to:
Oracle Database Appliance Software - Version 19.6.0.0.0 Bare Metal and laterInformation in this document applies to any platform.
This issue only applies to ODA systems running the dcs stack, not oakcli or Virtualized OS.
Symptoms
While crs and database were up, we ran
# odacli update-cpucore -c 4
and both ODA nodes panicked and rebooted. The crs failed to start automatically after reboot. We then tried to start the crs and got the messages
[root@db1 ~]# crsctl start cluster -all
CRS-4404: The following nodes did not reply within the allotted time:
db1, db2
CRS-4705: Start of Clusterware failed on node db1.
CRS-4705: Start of Clusterware failed on node db2.
CRS-4000: Command Start failed, or completed with errors.
Storage also down
[root@db1 root]# /opt/oracle/dcs/bin/odacli validate-storagetopology
INFO : ODA Topology Verification
INFO : Running on Node0
INFO : Check hardware type
SUCCESS : Type of hardware found : X8-2
INFO : Check for Environment(Bare Metal or Virtual Machine)
SUCCESS : Type of environment found : Bare Metal
INFO : Check number of Controllers
SUCCESS : Number of ahci controller found : 1
SUCCESS : Number of External SCSI controllers found : 2
INFO : Check for Controllers correct PCIe slot address
SUCCESS : Internal LSI SAS controller :
SUCCESS : External LSI SAS controller 0 : 3b:00.0
SUCCESS : External LSI SAS controller 1 : e3:00.0
INFO : Check if JBOD powered on
SUCCESS : 0JBOD : Powered-on
INFO : Check for correct number of EBODS(2 or 4)
FAILURE : Check for correct number of EBODS(2 or 4) : 1
ERROR : 1 EBOD found on the system, which is less than 2 EBODS with 1 JBOD
INFO : Above details can also be found in the log file=/opt/oracle/oak/log/db1/storagetopology/StorageTopology-2020-06-02-17:29:44_20679_22146.log
In the /var/log/messages file, you will see command timeout errors like this:
[74632.674538] mpt3sas_cm1: Command Timeout
[74659.298540] mpt3sas_cm1: _base_display_fwpkg_version: timeout
[74659.305462] mpt3sas_cm1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:8969/_scsih_refresh_expander_links()!
[74685.922554] mpt3sas_cm1: _base_display_fwpkg_version: timeout
[74728.418536] mpt3sas_cm0: Command Timeout
[74737.122537] mpt3sas_cm0: Command Timeout
[74737.122540] mpt3sas_cm1: Command Timeout
[74786.274538] mpt3sas_cm1: Command Timeout
[74786.274539] mpt3sas_cm0: Command Timeout
[74802.658544] mpt3sas_cm0: _base_display_fwpkg_version: timeout
[74812.898564] mpt3sas_cm1: _base_display_fwpkg_version: timeout
[74843.106545] mpt3sas_cm1: Command Timeout
[74869.730551] mpt3sas_cm1: _base_display_fwpkg_version: timeout
[74899.938548] mpt3sas_cm1: Command Timeout
[74949.090549] mpt3sas_cm1: Command Timeout
[74975.714554] mpt3sas_cm1: _base_display_fwpkg_version: timeout
[74986.466552] mpt3sas_cm0: Command Timeout
[74988.002553] mpt3sas_cm0: Command Timeout
[75016.162555] mpt3sas_cm1: Command Timeout
[75037.154556] mpt3sas_cm0: Command Timeout
Changes
No changes in hardware. This issue will occur when reducing the number of cpu cores.
Due to the multiple bugs in core reduction in the dcs stack, it is recommended that the cpu core reduction be done using the procedure that I've outlined below in the Solution area.
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |