SM BIOS Uncorrectable CPU-complex Error in ILOM SEL and system hard hangs when running sosreport or commands such as lspci
(Doc ID 1490545.1)
Last updated on JUNE 05, 2018
Applies to:Sun Storage 6Gb SAS PCIe RAID HBA - Version Not Applicable to Not Applicable [Release N/A]
Oracle Exalogic Elastic Cloud X2-2 Full Rack - Version X2 to X4 [Release X2 to X4]
Exadata Database Machine X2-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-8 - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
A system running with an LSI Sun StorageTek 6Gb/s SAS PCIe RAID HBA - SGX-SAS6-R-INT-Z (and possibly the Blade REM equivalent SGX-SAS6-R-REM-Z) may experience a system hard hang when running low level hardware commands such as lspci and similar commands run by sosreport scripts.
Note that this is just *ONE* possible cause of a "Uncorrectable CPU-complex Error" and there are other unrelated triggers which can cause this kind of error. This specific document refers to a "Uncorrectable CPU-complex Error" and system "hard hang" followed by a system reset triggered by running sosreport, sundiag or specific low level hardware commands such as lspci, udevinfo, dmraid, dmidecode, x86info, lshal on Linux. The event is more likely to be triggered if these commands are run repeatedly or a mixture of these type of commands are run in parallel (like when sosreport is run). If you are seeing this error under other conditions then it may be unrelated to this issue.
When the system hangs FMA should flag one or more CPUs in the system as faulty with a failure code of "fault.cpu.intel.internal". The CPU which is flagged as faulty can change on each occurrence and the CPU itself is *NOT* actually at fault and should not be replaced. The event logs on the ILOM will also report an uncorrectable MCA error. The ereport logs will show "ereport.cpu.intel.caterr" followed by "ereport.cpu.intel.internal_timer" (see symptoms section for example below). If you do not see this then it may be a different issue.
The issue is caused by a firmware issue on the LSI PCIe card which causes a (ROB) time-out to occur.
This issue has been seen on systems running Oracle VM 3.1 but may also be seen on any systems running Oracle Enterprise Linux or even Red Hat Releases which Oracle VM is based on.
This issue has now also been found to affect Exadata and Exalogic nodes running Oracle Enterprise Linux or Oracle VM. It has also been found to affect Exalytics X2-4 and X3-4 systems running old LSI firmware versions.
This issue does NOT affect systems running Solaris or Windows.
The following are examples of the kind of error you may see in the ILOM event logs and SEL after the system hang:
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!