Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9
(Doc ID 1006517.1)
Last updated on NOVEMBER 29, 2016
Applies to:Sun Fire E25K Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Sun Fire 12K Server
Sun Fire E20K Server
This document addresses uncorrectable CPU/Memory errors reported on systems running Solaris[TM] 8 and Solaris[TM] 9.
Your system may have one or more of the following symptoms:
- The system may have unexpectedly rebooted and cause is unknown.
- The system may have received UE, ECC errors, or recoverable memory errors.
- The system may be described as crashed, gone down, paniced, panic'd, panic'ed, panicked, rebooted, or received CPU or memory errors
- Example error messages which may have been reported are as follows:
A. Uncorrectable ECC error on from a read from system memory
Main memory uncorrectable ECC error detected by CPU3 from the bank of DIMMs in Slot A: J8100 J8101 J8201 J8200
WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU3 in
Privileged mode at TL=0, errID 0x... AFSR 0x00100004.000000aa AFAR
0x000000a0.0c06f1e0 Fault_PC 0x1015725c Esynd 0x00aa Slot A: J8100
J8101 J8201 J8200
SUNW,UltraSPARC-IV: [AFT1] errID 0x... Two Bits were in error
Main memory uncorrectable ECC error for a prefetch or store queue fill read.SUNW,UltraSPARC-IV: [ID 581396 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x... AFSR 0x00400000.000000aa AFAR 0x000000a0.0c0ab1f0 Fault_PC 0xff1c1c80 Esynd 0x00aa Slot A: J8100 J8101 J8201 J8200
SUNW,UltraSPARC-IV: [ID 468316 kern.notice] [AFT1] errID 0x... Two Bits were in error
A Main memory uncorrectable ECC error detected by Schizo id 9pcisch: WARNING: uncorrectable error detected by pci0 (safari id 00000000.00000009) during DVMA read transaction
pcisch: Transaction was a block operation.
pcisch: dvma access, Memory safari command, address 000000d0.cb1489a0, owned_in not asserted.
pcisch: AFSR=40000000.89000063 AFAR=000000d0.cb1489a0, quad word offset 00000000.00000002, Memory Module Slot D: J3100 J3101 J3201 J3200 id 9.
pcisch: mtag 0, mtag ecc syndrome 0
Uncorrectable Mtag ECC errors from main memory cause a fatal reset, domain pause or dstop depending on the platform.
B. CPU Uncorrectable ECC errors
SUNW,UltraSPARC-III+: WARNING: [AFT1] EDU Event detected by CPU1 at TL=0, errID 0x.... AFSR 0x00000018.0000017c AFAR 0x000000a0.0c0ab1f0 Fault_PC 0x1000c19c Esynd 0x017c
SUNW,UltraSPARC-III+: [AFT1] errID 0x.... Four Bits were in error
UCU uncorrectable E$ ECC event
EDU:ST uncorrectable E$ ECC event for store merge
EDU:BLD uncorrectable E$ ECC event for block load
WDU uncorrectable E$ ECC event for writeback (victimization)
CPU uncorrectable E$ ECC event for copyout (snoop request)
L3_TUE_SH multiple-bit ECC error on L3 cache tag access due to copyback, or tag update from foreign Fireplane device, snoop request
L3_TUE multiple-bit ECC error on L3 cache tag access due to core specific tag access
L3_EDU multiple-bit ECC error on L3 cache data access for P-cache and W-cache request
L3_UCU multiple-bit ECC error on L3 cache data access for I-cache and -cache request
L3_CPU multiple-bit ECC error on L3 cache data access for copyout
L3_WDU multiple-bit ECC error on L3 cache data access for writeback
Error Messaging Notes
- When browsing messages files and observing console output note that [AFT1] is included in these messages, a 1 represents the "Asynchronous Fault Trap" for uncorrectable and unrecoverable errors. AFT0 is used for correctable errors, AFT2 and AFT3 can be ignored in almost all cases.
- The above error messaging may change slightly depending on your kernel update patch version.
- It is important to understand that uncorrectable ECC errors can be reported by multiple components. At no point will the corrupted data actually be used.
This document does not apply to Solaris[TM] 10 as FMA automates the diagnosis of these type of faults. See <Document:1018939.1> Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted If Solaris has not paniced, crashed, or rebooted and you are just seeing correctable errors please see <Document:1006513.1> Troubleshooting Sun Fire[TM] Correctable CPU and Memory Error(s) on Solaris[TM] 8 and 9
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!