Troubleshooting Sun Fire[TM] Correctable CPU and Memory Error(s) on Solaris[TM] 8 and 9
(Doc ID 1006513.1)
Last updated on MAY 06, 2019
Applies to:
Solaris Operating System - Version 8.0 to 9 9/05 HW U9 [Release 8.0 to 9.0]All Platforms
Purpose
This document addresses correctable CPU/Memory errors reported on systems running Solaris[TM] 8 and Solaris[TM] 9.
Before proceeding, it is important to understand that a certain number of correctable errors are expected to be observed. These correctable errors will in almost all cases cause no detectable impact to the performance of a system.
Note: This document does not apply to Solaris[TM] 10 or higher as FMA automates the diagnosis of these type of faults. See Document 1018939.1 Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted.
Your system may have one or more of the following symptoms.
- The system may have received CE, ECC errors, or recoverable memory errors.
- The system may be described as having reported CPU or memory errors
- Example error messages which may have been reported are shown below:
Correctable ECC error on from a read from system memory
The following are types of main memory correctable ECC errors reported by the CPUs and also an example from a Schizo (I/O bridge chip):
Example #1: Main Memory Corrected ECC error detected by CPU3 from data read from the memory DIMM in Slot B J8000
SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU3 at TL=0, errID 0x... AFSR 0x00000002<CE>.00000058 AFAR 0x000000b1.08033f40 Fault_PC 0x1002603c Esynd 0x0058 Slot B: J8000 SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Memory Error on Slot B: J8000 is Persistent SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 68 was in error and correctedExample #2: A Main Memory Corrected MTAG ECC error detected by CPU1 on data read from Slot A J3000
SUNW,UltraSPARC-III+: NOTICE: [AFT0] EMC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00010000<EMC>.000b0000 AFAR 0x000000a1.1b01b730 Fault_PC <0x10351860> Msynd 0x000b Slot A: J3000 SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Mtag Error on Slot A: J3000 is Persistent SUNW,UltraSPARC-III+: [AFT0] errID 0x... MTAG Data Bit 1 was in error and corrected
Example #3: A Main memory corrected ECC error detected by Schizo id 8
pcisch: NOTICE: correctable error detected by pci0 (safari id 8) during DVMA read transaction pcisch: Transaction was a block operation. pcisch: dvma access, Memory safari command, address 000000b1.a8030170, owned_in not asserted. pcisch: AFSR=40000000.c800013c AFAR=000000b1.a8030170, quad word offset 00000000.00000003, Memory Module <Slot B: J8000> port id 8. pcisch: syndrome bits 13c pcisch: mtag 0, mtag ecc syndrome 0
CPU correctable ECC and parity errors
CPU Correctable ECC errors are detected and corrected by the CPU module containing the fault.
An example of a CPU L2SRAM Corrected ECC error detected by CPU1 from its own L2SRAM:
SUNW,UltraSPARC-III+: NOTICE: [AFT0] EDC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00000010<EDC>.00000141 AFAR 0x00000000.a745ad50 Fault_PC 0xfe0ba520 Esynd 0x0141 SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 93 was in error and correctedAdditional Events
There are multiple other CPU Correctable events that can be reported and these include a number of recoverable parity errors:
DPE D$ parity event DDSPE D$ data parity event DTSPE D$ physical tag parity event IPE I$ parity event IDSPE I$ data parity event ITSPE I$ physical tag parity event TSCE software correctable single-bit E$ tag ECC event THCE hardware corrected single-bit E$ tag ECC event UCC software correctable E$ ECC event EDC hardware corrected E$ ECC event WDC hardware corrected E$ ECC event for writeback (victimization) CPC hardware corrected E$ ECC event for copyout (snoop request) L3_MECC Both 16-byte data of L3 cache data access have ECC error (either correctable or uncorrectable ECC error). L3_THCE single bit ECC error on L3 cache tag access L3_EDC single bit ECC error on L3 cache data access for P-cache and W-cache request L3_UCC single bit ECC error on L3 cache data access for I-cache and D-cache request L3_CPC single bit ECC error on L3 cache data access for copyout L3_WDC single bit ECC error on L3 cache data access for writeback
- When browsing messages files and observing console output note that [AFT0] is included in these messages.
- A 0 represents the "Asynchronous Fault Trap" for correctable and recoverable errors.
- AFT1 is used for uncorrectable errors, AFT2 and AFT3 can be ignored in almost all cases.
- The above error messaging may change slightly depending on your kernel update patch version.
Uncorrectable Error Events
If Solaris has paniced, crashed, or rebooted and you are seeing uncorrectable errors please see Document 1006517.1 Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9 instead.
Troubleshooting Steps
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |