ODA HA: Massive IO Performance Issue On One of Two Nodes: Same OS version, Same Shared Storage,
(Doc ID 2559921.1)
Last updated on SEPTEMBER 07, 2023
Applies to:
Linux OS - Version Oracle Linux 6.10 to Oracle Linux 6.10 [Release OL6U10]Oracle Database Appliance Software - Version 12.1.2.10 to 18.1 [Release 12.1 to 12.2]
Information in this document applies to any platform.
Symptoms
A two-node ODA Cluster configuration using the same binaries and shared storage.
Extreme performance degradation for basic OS IO operations on one of two nodes after a database was moved from one node to another.
- The performance problem was only detected after the working database on Node1 was failed over to Node0.
- Both nodes had been patched and up and running for some time.
- Both nodes were at the exact same version of the OS
- Both used the exact same RDBMS version, data-set, queries.
- Both nodes shared the exact same disks, yet the IO performance was substantially worse on one of the two nodes.
OS, Hardware, RDBMS and ODA were all checked for problems via several SRs
- After weeks/months in the configuration AND failing the database over from one node to the other that a problem was discovered.
- Normal database and query tuning steps did not identify any difference in the explain plans or provide a reason for the problem.
- No hardware problems were detected after several checks
- No physical difference in resources or parameters were detected between the nodes including CPUs, settings, etc...
- No ODA differences were detected between the nodes using normal queries.
It was only after the database performance as the source was ruled out that a generic IO level test from each node confirmed the problem as system level from one node.
The following script was run for the IO level test:
!#/bin/bash
start_ts=$(date +%H:%M:%S:%N)
for i in {1..1000}
do
echo "Count: $i"
done
end_ts=$(date +%H:%M:%S:%N)
echo $start_ts
echo $end_ts
exit
It simply echos 1000 lines to the shell and show the start and enddate of the script.
The results were the following
[oda00]$ cat time_output_node0.txt
08:23:36:776457699
08:23:36:837322700
real 0m 0.068s
user 0m 0.044s
sys 0m 0.015s
[oda01]$ cat time_output_node1.txt
08:23:25:521320353
08:23:25:537715253
real 0m 0.020s
user 0m 0.010s
sys 0m 0.007s
Another test / pass resulted in the following time differences.
node0: 57.50323ms
node1: 15.596865ms
Changes
This particular installation had one upgrade to 18.3 after being imaged to ODA version 12.1.2.10
All components show as equal using ODACLI commands.
The OS also appeared to be at the same level.
ODA: 18.3.0.0.0
Linux: 6.10
Kernel: kernel-uek-4.1.12-124.18.6.el6uek.x86_64
However a closer look in the sosreport did detect a few differences.
The tainted levels were not equal on each node.
By itself having tainted levels are not necessarily a problem.
However, this was the first indication of a difference that previously had not been detected after several HW,OS and ODA checks.
tainted
node0 kernel.tainted = 69633
node1 kernel.tainted = 4097
A closer looks when trying to alter debug settings revealed more differences.
Files under the debug directory were not equal on both nodes with the problem node missing files.
node0:
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |