Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
(Doc ID 563566.1)
Last updated on DECEMBER 10, 2024
Applies to:
Oracle Database - Enterprise Edition - Version 9.2.0.1 and laterOracle Database Exadata Express Cloud Service - Version N/A and later
Gen 1 Exadata Cloud at Customer (Oracle Exadata Database Cloud Machine) - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Cloud Service - Version N/A and later
Information in this document applies to any platform.
Oracle Clusterware & Oracle Real Application Clusters
Symptoms
Summary
In Oracle RAC environments, RDBMS gathers global cache work load statistics which are reported in STATSPACK, AWRs and GRID CONTROL. Global cache lost blocks statistics ("gc cr block lost" and/or "gc current block lost") for each node in the cluster as well as aggregate statistics for the cluster represent a problem or inefficiencies in packet processing for the interconnect traffic. These statistics should be monitored and evaluated regularly to guarantee efficient interconnect Global Cache and Enqueue Service (GCS/GES) and cluster processing. Any block loss indicates a problem in network packet processing and should be investigated.
The vast majority of escalations attributed to RDBMS global cache lost blocks can be directly related to faulty or mis-configured interconnects. This document serves as guide for evaluating and investigating common (and sometimes obvious) causes.
Even though much of the discussion focuses on Performance issues, it is possible to get a node/instance eviction due to these problems. Oracle Clusterware & Oracle RAC instances rely on heartbeats for node memberships. If network Heartbeats are consistently dropped, Instance/Node eviction may occur. The Symptoms below are therefore relevant for Node/Instance evictions.
Symptoms:
Primary:
- "gc cr block lost" / "gc current block lost" in top 5 or significant wait event
Secondary:
- SQL traces report multiple gc cr requests / gc current request /
- gc cr multiblock requests with long and uniform elapsed times
- Poor application performance / throughput
- Packet send/receive errors as displayed in ifconfig or vendor supplied utility
- Netstat reports errors/retransmits/reassembly failures
- Node failures and node integration failures
- Abnormal cpu consumption attributed to network processing
Cause:
- Probable causes are noted in the Diagnostic Guide below (ordered by most likely to least likely cause)
Global Cache Block Loss Diagnostic Guide
- Faulty or poorly seated cables/cards/Switches
Description: Faulty network cable connections, the wrong cable, poorly constructed cables, excessive length and wrong port assignments, faulty switch can result in inferior bit rates, corrupt frames, dropped packets and poor performance.
Action: Engage network vendor to perform physical network checking, replace faulty network parts. CAT 5 grade cables or better should be deployed for interconnect links. All cables should be securely seated and labeled according to LAN/port and aggregation, if applicable. Cable lengths should conform to vendor ethernet specifics. -
Poorly sized UDP receive (rx) buffer sizes / UDP buffer socket overflows
Description: Oracle RAC Global cache block processing is bursty in nature and, consequently, the OS may need to buffer receive(rx) packets while waiting for CPU. Unavailable buffer space may lead to silent packet loss and global cache block loss. 'netstat -s' or 'netstat -su' on most UNIX will help determine UDPInOverflows, packet receive errors, dropped frames, or packets dropped due to buffer full errors.
Action: Packet loss is often attributed to inadequate( rx) UDP buffer sizing on the recipient server, resulting in buffer overflows and global cache block loss. The UDP receive (rx) buffer size for a socket is set to 128k when Oracle opens the socket when the OS setting is less than 128k. If the OS setting is larger than 128k Oracle respects the value and leaves it unchanged. The UDP receive buffer size will automatically increase according to database block sizes greater than 8k, but will not increase beyond the OS dependent limit. UDP buffer overflows, packet loss and lost blocks may be observed in environments where there are excessive timeouts on "global cache cr requests" due to inadequate buffer setting when DB_FILE_MULTIBLOCK_READ_COUNT is greater than 4. To alleviate this problem, increase the UDP buffer size and decrease the DB_FILE_MULTIBLOCK_READ_COUNT for the system or active session.
To determine if you are experiencing UDP socket buffer overflow and packet loss, on most UNIX platforms, execute
udp_max_buf controls how large send and receive buffers (in bytes) can be for a UDP socket. The default setting, 262,144 bytes, may be inadequate for STREAMS applications. sq_max_size is the depth of the message queue. - For AIX platform only, VIPA and DGD setting incorrect
If Virtual IP Address (VIPA) is used for cluster_interconnect for AIX platform, then Dead Gateway Detection (DGD) must be configed to allow UDP failover.
The default DGD parameters are recommended as a starting point, but may need to be tuned based on customer environment, however in all cases must be set to a value greater than one. The default settings are:
dgd_packets_lost = 3
dgd_ping_time = 5
dgd_retry_time = 5
Refer check with AIX / IBM for Using VIPA and Dead Gateway Detection on AIX for High Availability Networks, including Oracle RAC - For Solaris + Veritas LLT environment, misconfigured switch
It is observed from VCS command lltstat, whenever "Snd retransmit data" increases, gc block lost count also increases.
Change the interconnect switch speed from fixed to auto-negotiate and in the interconnect switch, distribute the cables more evenly to each modules, helps to stop the "gc blocks lost". - For 12.1.0.2, <Bug 20922010> FALSE 'GC BLOCKS LOST' REPORTED ON 12.1 AFTER UPGRADING FROM 11.2.0.3
It has been fixed in 12.1.0.2.161018 and 12.2.0.1, please refer to the following documents for more information:
<Note 20922010.8> Bug 20922010 - False 'gc blocks lost' reported on 12.1 after upgrading from 11.2
<Note 2096299.1> False increase of 'Global Cache Blocks Lost' or 'gc blocks lost' after upgrade to 12c
Changes
As explained above, Lost blocks are generally caused by unreliable Private network. This can be caused by a bad patch or faulty network configuration or hardware issue.
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Summary |
Symptoms: |
Cause: |
Global Cache Block Loss Diagnostic Guide |
Changes |
Cause |
Solution |
Community Discussions |
References |