Maximizing Availability with Engineered Systems - Exadata
(Doc ID 1571965.1)
Last updated on SEPTEMBER 29, 2021
Applies to:
Oracle Database Products > ExadataInformation in this document applies to any platform.
Details
Maximizing Availability with Engineered Systems - Exadata
Oracle would like all our customers to ensure that they are able to achieve the highest availability with any application or database deployed on Exadata Database Machine. The Oracle Platinum Service is an integral part of achieving maximum availability and uptime; however, there are other key aspects required to ensure the highest stability and recoverability in the face of various unplanned failures and planned maintenance activities.
The goal of this document is to outline each key component and set realistic expectations for any customer interested in the Oracle Platinum Service. The document will cover
- Customer Review
- Understanding HA Business requirements
- MAA Architecture to meet HA business requirements
- Configuration and Operational Practices
- Restoration vs Root Cause Analysis
- Customer Worksheet
Customer Review
Each customer will have different High Availability (HA) requirements and restoration vs Root Cause Analysis (RCA) priorities. The materials below allow each customer to set the correct expectations on restoration of various outages; so Oracle and the customer can work collaboratively on your top priorities.
Furthermore, Configuration and Operational Practices is critical and we strongly recommend that an Exachk report is run every month and any failures or warnings are addressed. The latest Exadata and MAA best practices will be reflected by Exachk when using the latest Exachk version.
Understanding HA Business Requirements
It is important that you document your business HA requirements and ensure that they can be met with your chosen architecture. You can use the high availability analysis framework described in HA Overview, Chapter 2 Determining Your High Availability Requirements as a reference. Key HA requirements include:
Cost of Downtime - A complete business impact analysis provides the insight needed to quantify the cost of unplanned and planned downtime. Understanding this cost is essential because it helps prioritize your high availability investment and directly influences the high availability technologies that you choose to minimize the downtime risk.
Recovery Time Objective (RTO) - The business impact analysis will determine your recovery time objective (RTO). RTO is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering unacceptable consequences (financial losses, customer dissatisfaction, reputation, and so on). RTO indicates the downtime tolerance of a business process or an organization in general.
Recovery Point Objective (RPO) - The business impact analysis also determines your recovery point objective (RPO). RPO is the maximum amount of data that an IT-based business process may lose without harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general.
MAA Architecture to meet HA business Requirements
Exadata Database Machine is engineered for high availability. However, for the fastest repair and highest uptime, we recommend Exadata Maximum Availability architecture (MAA ) consisting of primary Exadata, a viable standby Exadata, as well as sufficient test and development Exadata machines to validate any change prior to incorporating into the primary Exadata. With a standby Exadata using Active Data Guard or Oracle GoldenGate, you will be able to achieve fastest repair times for outages such as data corruptions, full cluster or database failures, or Database Machine failures due to disasters. Otherwise, Oracle Platinum Service may recommend rebooting servers or restoring from backups for data or storage failures if there is not a viable standby failover target available.
We recommend following Exadata MAA blueprint as described in MAA Best Practices for Oracle Exadata Database Machine (technical Technical Brief paper) or referring to our Exadata MAA best practices in our Exadata MAA OTN website.
For unplanned outages, Exadata Database Machine is fault-tolerant and integrated with the MAA best practices to provide the following benefits:
- Tolerates node and instance failures by Oracle RAC
- Tolerates disk and cell failures by Oracle ASM and Oracle Exadata Storage Server Grid. Exadata MAA recommends ASM high redundancy disk groups for the best data protection and for redundancy during Exadata rolling upgrades.
- Attempts to prevent and automatically repair corruptions when using the Oracle ASM automatic repair mechanism, the Exadata storage built-in corruption checks, and the Oracle generic block corruption parameters. Note that some Oracle corruption prevention, detection and repair best practices as described in Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (Doc ID 1302539.1) are not implemented by default due to varying performance impact., You will need to take further action to evaluate if enabling and configuring these best practices are viable for your given environment.
- Provides redundant and fault tolerant ports, cables, host channel adapters, and bonded networks
- Provides the ability to quickly repair full cluster or Database Machine failures by using Oracle Data Guard and another Exadata Database Machine if available.
For planned maintenance, Exadata Database Machine provides the following benefits:
- Supports Oracle ASM, Oracle Clusterware, and Oracle RAC rolling upgrade or software changes
- Supports Oracle Exadata Storage Server Software rolling upgrade for patches
- Allows application and system changes with Oracle Data Guard and Oracle GoldenGate if available.
- Supports all of the online maintenance capabilities that are generic to the database
Configuration and Operational Practices
Oracle Platinum Services provides some of the key configuration and operational best practices; however, there are several key practices required by the customer to maintain stability. The details are described in MAA Best Practices for Oracle Exadata Database Machine (technical Technical Brief paper) and highlighted below:
- Understand HA requirements as described above.
- Download and run the latest Exachk – Oracle Exadata Database Machine exachk or HealthCheck (Doc ID 1070954.1) and address all hardware, software, configuration FAILURES and WARNINGS every month or prior or post any maintenance.
As part of the exachk report, you may be notified about Exadata critical issues described in Exadata Critical Issues (Doc ID 1270094.1). If the critical issue is relevant for your environment, you may need to apply the relevant workaround or patch to your test, standby and eventually your production systems.
- Utilize test environment, ideally an identical Exadata replica, to evaluate any change (e.g. software, hardware or application changes) before incorporating onto your standby or production system. Follow the Exadata testing and patching practices described in Exadata Patching Overview and Patch Testing Guidelines (Doc ID 1262380.1) which includes functional testing, performance testing and application HA testing. Oracle conducts a comprehensive set of tests with every release and every Engineered System; however, this does not substitute your workload and application testing and its impact on HA and performance. When conducting HA tests, record your achievable restoration times for various outages and ensure they meet your requirements.
- If standby Exadata is deployed, execute Data Guard role transitions to ensure all procedures are validated and the standby is failover or switchover ready.
- Utilize Exadata Monitoring best practices to manage and monitor database and system performance. Oracle Platinum Service will monitor for hardware faults and configuration best practices using ASR, OCM and Oracle’s customized gateway but the customer must monitor and alert on database and system performance due to application changes or workload growth.
- If there’s a failure resulting in a database node failure, your application must be configured to react to these failures and fail over transparently to achieve the highest availability. Refer to Client Failover Best Practices for Data Guard 11g Release 2 and implement and validate these best practices to achieve the highest application availability.
Restoration vs Root Cause Analysis
Restoration of service may require restarting a server or failing over the database to Data Guard standby or GoldenGate replica. Other examples include restoring from backup if multiple disk failures or corruptions have rendered the database unusable and when there’s no viable Data Guard standby or GoldenGate replica. For complete list of outages and restoration action plans, refer to Exadata Platinum Customer Outage Classifications and Restoration Action Plans (Doc ID 1483344.1)
Root cause analysis (RCA), on other hand, requires:
- Clear problem statement
- All traces and information required to understand and analyze the problem. Oracle attempts to generate the necessary traces at the time of incident; the customer needs to jointly gather the necessary information and provide it to Support by uploading to the Service Request. Examples of what needs to be gathered for different incidents are described in Diagnostic Assistant: General Information (Doc ID 201804.1) and Exadata Diagnostic Collection Guide (Doc ID 1353073.1).
- Use case or reproducible case in the worse case if the traces are not sufficient or additional tracing implemented to gather more details if the problem resurfaces
RCA does not expedite restoration and in some cases may hinder restoration of service since restarting the service or database may be postponed for deeper diagnostics and analysis of the existing faulty system or component.
If maximizing availability is your top priority, you must choose to focus on restoration of service first and then proceed with RCA. If RCA is prioritized, then service uptime may suffer in the short term for some cases.
Please fill in the worksheet below and forward to your internal DBA and Operations Team for tracking and ongoing management especially in regards to their responsibilities around executing exachk and their involvement in restoration and repair scenarios.
Please download the checklist from here per sample below.
For your Exadata system enrolled into Platinum support and its databases, please check / highlight relevant answers:
Rack Serial Number |
Exadata Configuration |
Rack Size |
Exachk report executed monthly and prior or after any maintenance |
Restoration Plans Doc ID 1483344.1 review performed by system staff that logs SRs |
AK12345678 (sample #)
|
Please Describe:________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________
|
Database Name(s) / Group(s) |
Database type |
Backup Restoration Tested |
DB Switchover Performed |
Application Configured to Failover to Standby |
Top Priority
|
<db_name_1, db_name_1X, …> |
|||||
<db_name_2, db_name_2X, …> |
|||||
<db_name_3, db_name3X, …> |
|||||
… |
… |
… |
… |
… |
… |
<db_name_n, db_name_nX, …> |
Actions
Contacts
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |