Exadata - Possible high latency for redo log writes when using IORM limits (Doc ID 2197953.1)

Last updated on NOVEMBER 30, 2016

Applies to:

Oracle Exadata Storage Server Software - Version 12.1.2.2.1 to 12.1.2.3.2 [Release 12.1]
Information in this document applies to any platform.

Symptoms

When IORM is configured on the cells using limits, as shown in the following example:

cellCLI> list iormplan detail
name: cel02_IORMPLAN
catPlan:
dbPlan: name=DB1,level=1,allocation=30,limit=50
name=DB2,level=1,allocation=30,limit=50
name=DB3,level=1,allocation=20,limit=30
name=DB4,level=1,allocation=10,limit=20
name=other,level=1,allocation=10,limit=15
objective: auto
status: active

Upgrade to 12.1.2.3.2 could cause a performance degradation on the system with the following symptoms:

  1. Wait event log file sync shows significant increase in Wait time when the same workload is run and the same IORM plan is used.
  2. AWR IOStat by Function summary section shows an increase in average time for LGWR.
  3. If cell parameter _cell_latency_threshold_print_warning is enabled, operation 'RedoLog Write' reports latency exceeds threshold.
  4. IO throttling is evident in the IORM performance statistics in a cell system state dump.

 

High latency on wait event log file sync

Same IORM plan before the upgrade, log file sync reported avg wait under 1 ms. After upgrade, log file sync reported up to 30 ms.

From AWR IO profile:

Before Upgrade

IOStat by Function summary

'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,G,T,P are in multiples of 1000
ordered by (Data Read + Write) desc

Function Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Waits: Count Avg Tm(ms)
Smart Scan 9.7T 5806.65 5.5G 0M 0.00 0M 0
Buffer Cache Reads 70.2G 4072.38 39.835M 0M 0.00 0M 7086.7K 1.77
DBWR 0M 0.00 0M 27.3G 1412.25 15.501M 0
LGWR 1M 0.03 .001M 13.5G 1701.03 7.633M 1163.5K 0.75
Others 5.5G 31.74 3.099M 4.7G 4.92 2.667M 17.6K 1.98
Direct Writes 0M 0.00 0M 2.2G 2.72 1.237M 0
Direct Reads 340M 2.71 .188M 9M 0.59 .005M 0
TOTAL: 9.8T 9913.51 5.6G 47.7G 3121.50 27.043M 8267.8K 1.63

After Upgrade

IOStat by Function summary

'Data' columns suffixed with M,G,T,P are in multiples of 1024 other columns suffixed with K,M,G,T,P are in multiples of 1000
ordered by (Data Read + Write) desc

Function Name Reads: Data Reqs per sec Data per sec Writes: Data Reqs per sec Data per sec Waits: Count Avg Tm(ms)
Smart Scan 4.9T 2932.73 2.8G 0M 0.00 0M 0
Buffer Cache Reads 82.7G 3430.62 46.717M 0M 0.00 0M 5430.8K 1.86
Direct Writes 72M 0.08 .04M 22.8G 100.42 12.896M 0
DBWR 3M 0.25 .002M 18G 798.22 10.153M 458 3.96
Others 7.3G 15.32 4.119M 7G 6.03 3.956M 15.7K 3.20
LGWR 1M 0.03 .001M 12.8G 1078.00 7.253M 715.1K 1.23
Direct Reads 5G 9.82 2.825M 751M 1.15 .414M 0
Streams AQ 3M 0.15 .002M 0M 0.00 0M 274 2.80
TOTAL: 5T 6388.99 2.8G 61.4G 1983.81 34.672M

But the latency on log file sync event is High:

Before the upgrade:

Top 10 Foreground Events by Total Wait Time

Event Waits Total
Wait Time (sec) Wait Avg(ms) % DB time Wait Class
cell single block physical read 6,861,297 14.9K 2 27.1 User I/O
SQL*Net message from dblink 990,818 12.7K 13 23.1 Network
DB CPU 11.2K 20.3
cell smart table scan 2,944,674 2603.1 1 4.7 User I/O
log file sync 290,032 2055 7 3.7 Commit
enq: KO - fast object checkpoint 37,123 1158.6 31 2.1 Application
cell list of blocks physical read 98,833 532.8 5 1.0 User I/O
gc current block busy 76,850 529.5 7 1.0 Cluster
cell multiblock physical read 109,138 404.4 4 .7 User I/O
row cache lock 1,397,482 247.1 0 .4 Concurrency

After the Upgrade:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tota Wait % DB
Event Waits Time Avg(ms) time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
SQL*Net message from dblink 1,184,777 10.9 9 18.0 Network
log file sync 270,355 7972 29 13.2 Commit
DB CPU 6817 11.3
cell single block physical rea 1,823,734 5700 3 9.4 User I/O
gc buffer busy acquire 112,535 3946 35 6.5 Cluster
gc current block busy 76,320 3430 45 5.7 Cluster
gc cr block busy 60,699 2407 40 4.0 Cluster
gc buffer busy release 19,917 1503 75 2.5 Cluster
cell smart table scan 1,878,133 763. 0 1.3 User I/O
direct path write temp 35,802 383. 11 .6 User I/O

Enabling parameter _cell_latency_threshold_print_warning on the cells, it will report RedoLog writes exceeds the threshold

CellCLI> alter cell events = "immediate cellsrv.cellsrv_setparam('_cell_latency_threshold_print_warning', 'true')"

Tue Aug 30 12:36:49 2016
[79] Latency threshold check: IO Completed: Operation 'RedoLog Write'. Latency: '1088 ms' exceeds threshold: '500 ms'
[79] Client host: db05.mydomain.com, pid: 240096, RQ_Tag_1860373023_62127: lastError: 0 job (0x6000e2378570) state: 2, jobType: 4, startTime: Tue Aug 30 12:36:47 2016 992 msec currentTime: Tue Aug 30 12:36:49 2016 80 msec
[79] Griddisk: DATA_EXA_CD_04_cel03, Celldisk: CD_04_cel03
[79] Pending IOs [S: 1024 bytes R: 0, W: 0, R: 0 bytes, W: 0 bytes]
[79] Phase: IO_issue_to_IO_completion (1088000 us) (1 trips)
[79] Total for all phases: (1088000 us)

Generating a system state dump, it will show Disk IO throttling

After enabling IORM tracing on the cell and generating system state dump (see DIAGNOSTIC section), signs of IO throttling on the hardDisks:

$ grep dbidx svtrc_7300_0.trc |grep maxcapwait |grep -v '#maxcapwaits=0'
dbidx=0 bitmap=0x0 DB1 limit=50, #maxcapwaits=907
dbidx=1 bitmap=0x0 DB2 limit=50, #maxcapwaits=2796
dbidx=2 bitmap=0x0 DB3 limit=30, #maxcapwaits=294
dbidx=0 bitmap=0x0 DB4 limit=50, #maxcapwaits=1113

Also the maxcap is only reported for the HardDisks:

We can see that maxcapwaits is only for the harddisks, not for the flashdisks
IORM stats for disk=/dev/sdd
dbidx=0 bitmap=0x0 DB1 limit=50, #maxcapwaits=838
dbidx=1 bitmap=0x0 DB2 limit=50, #maxcapwaits=2246
dbidx=2 bitmap=0x0 DB3 limit=30, #maxcapwaits=267
IORM stats for disk=/dev/sdc
dbidx=0 bitmap=0x0 DB1 limit=50, #maxcapwaits=64
dbidx=1 bitmap=0x0 DB2 limit=50, #maxcapwaits=428
dbidx=2 bitmap=0x0 DB3 limit=30, #maxcapwaits=20

 

Changes

This behavior was seen after updating storage servers to 12.1.2.3.2.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms