My Oracle Support Banner

Troubleshooting Sun Fire[TM] Correctable CPU and Memory Error(s) on Solaris[TM] 8 and 9 (Doc ID 1006513.1)

Last updated on MAY 06, 2019

Applies to:

Solaris Operating System - Version 8.0 to 9 9/05 HW U9 [Release 8.0 to 9.0]
All Platforms

Purpose

This document addresses correctable CPU/Memory errors reported on systems running Solaris[TM] 8 and Solaris[TM] 9.

Before proceeding, it is important to understand that a certain number of correctable errors are expected to be observed.  These correctable errors will in almost all cases cause no detectable impact to the performance of a system.

Note: This document does not apply to Solaris[TM] 10 or higher as FMA automates the diagnosis of these type of faults.  See Document 1018939.1 Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted.

Your system may have one or more of the following symptoms.

 

Correctable ECC error on from a read from system memory

 

The following are types of main memory correctable ECC errors reported by the CPUs and also an example from a Schizo (I/O bridge chip):

Example #1: Main Memory Corrected ECC error detected by CPU3 from data read from the memory DIMM in Slot B J8000

SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU3 at TL=0, errID 0x... 
AFSR 0x00000002<CE>.00000058 AFAR 0x000000b1.08033f40 Fault_PC 0x1002603c Esynd 0x0058 Slot B: J8000
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Memory Error on Slot B: J8000 is Persistent
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 68 was in error and corrected

Example #2:  A Main Memory Corrected MTAG ECC error detected by CPU1 on data read from Slot A J3000

SUNW,UltraSPARC-III+: NOTICE: [AFT0] EMC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00010000<EMC>.000b0000 
AFAR 0x000000a1.1b01b730 Fault_PC <0x10351860> Msynd 0x000b Slot A: J3000
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Corrected Mtag Error on Slot A: J3000 is Persistent
SUNW,UltraSPARC-III+: [AFT0] errID 0x... MTAG Data Bit 1 was in error and corrected

Example #3:  A Main memory corrected ECC error detected by Schizo id 8

pcisch: NOTICE: correctable error detected by pci0 (safari id 8) during DVMA read transaction
pcisch:    Transaction was a block operation.
pcisch:    dvma access, Memory safari command, address 000000b1.a8030170, owned_in not asserted.
pcisch:    AFSR=40000000.c800013c AFAR=000000b1.a8030170, quad word offset 00000000.00000003, 
Memory Module <Slot B: J8000> port id 8.
pcisch: syndrome bits 13c
pcisch:    mtag 0, mtag ecc syndrome 0

CPU correctable ECC and parity errors

 

CPU Correctable ECC errors are detected and corrected by the CPU module containing the fault.

An example of a CPU L2SRAM Corrected ECC error detected by CPU1 from its own L2SRAM:

SUNW,UltraSPARC-III+: NOTICE: [AFT0] EDC Event detected by CPU1 at TL=0, errID 0x... AFSR 0x00000010<EDC>.00000141 
AFAR 0x00000000.a745ad50 Fault_PC 0xfe0ba520 Esynd 0x0141
SUNW,UltraSPARC-III+: [AFT0] errID 0x... Data Bit 93 was in error and corrected

Additional Events

There are multiple other CPU Correctable events that can be reported and these include a number of recoverable parity errors:

DPE     D$ parity event
DDSPE   D$ data parity event
DTSPE   D$ physical tag parity event
IPE     I$ parity event
IDSPE   I$ data parity event
ITSPE   I$ physical tag parity event
TSCE    software correctable single-bit E$ tag ECC event
THCE    hardware corrected single-bit E$ tag ECC event
UCC     software correctable E$ ECC event
EDC     hardware corrected E$ ECC event
WDC     hardware corrected E$ ECC event for writeback (victimization)
CPC     hardware corrected E$ ECC event for copyout (snoop request)
L3_MECC   Both 16-byte data of L3 cache data access have ECC error (either correctable or uncorrectable ECC error).
L3_THCE   single bit ECC error on L3 cache tag access 
L3_EDC    single bit ECC error on L3 cache data access for P-cache and W-cache request
L3_UCC    single bit ECC error on L3 cache data access for I-cache and D-cache request 
L3_CPC    single bit ECC error on L3 cache data access for copyout 
L3_WDC    single bit ECC error on L3 cache data access for writeback
  • When browsing messages files and observing console output note that [AFT0] is included in these messages.
    • A 0 represents the "Asynchronous Fault Trap" for correctable and recoverable errors. 
    • AFT1 is used for uncorrectable errors, AFT2 and AFT3 can be ignored in almost all cases.
  • The above error messaging may change slightly depending on your kernel update patch version.

Uncorrectable Error Events

If Solaris has paniced, crashed, or rebooted and you are seeing uncorrectable errors please see Document 1006517.1  Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9 instead.

Troubleshooting Steps

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.