Recommended BDA Settings to Improve Performance Slowdowns Characterized by Network Timeouts or Network-Related Issues/Exceptions Like SocketTimeoutException
(Doc ID 2569049.1)
Last updated on APRIL 17, 2023
Applies to:Big Data Appliance Integrated Software - Version 4.5.0 and later
Recommendations to improve performance slowdowns characterized by network timeouts or network-related issues/exceptions like SocketTimeoutException consist of:
- Enabling the kernel option cm_ibcrc_as_csum
- Upgrading the HCA firmware to the latest supported on the BDA for the ConnectX-3 HCA model which is currently 2.35.5532.
Frequently Asked Questions
1. What does the Linux kernel option cm_ibcrc_as_csum indicate?
cm_ibcrc_as_csum indicates whether to utilize IB-CRC as CSUM in connected mode. It enables a TCP checksum offload feature. After this, the IPoIB internally relies on the CRC that Infiniband Connection Mode generates at the sender and verifies at the receiver. This is done by the HCA and so has good performance.
The default value is enabled or 1.
2. If there are multiple clusters in a rack does cm_ibcrc_as_csum need to have the same value in each cluster?
If the hosts in one cluster do not communicate with the hosts in the other cluster then the value of cm_ibcrc_as_csum in one cluster has no bearing on that in the other.
Only when hosts in one cluster communicate with hosts in another cluster, does the value of cm_ibcrc_as_csum, need to be the same.
3. If an edge node uses Oracle Linux with the Red Hat Compatible Kernel is the feature "IB CRC as TCP checksum" enabled or not?
The value of this parameter on edge nodes that are not BDA nodes does not matter since they are not connected via Infiniband to BDA clusters.
Enabling the kernel option cm_ibcrc_as_csum
The recommendation is to enable the kernel option cm_ibcrc_as_csum.
The cm_ibcrc_as_csum flag enables 2 features that address known performance issues in the InfiniBand IPoIB network module. In generations prior to X7, these performance issues were not affecting BDA servers significantly hence the flag was disabled to allow compatibility with older clusters.
The increased parallelism of added cores in X7 has increased the impact of these performance issues in X7 and in certain cases they become a bottleneck for the network kernel subsystem resulting in periods of significantly degraded performance characterized by network timeouts or significantly slower job performance. Note however that the same can be observed on non-X7 servers as well.
Moving forward this flag will be enabled by default on BDA servers. Until then the recommendation is to enable cm_ibcrc_as_csum on any cluster which is running into network timeouts or other network-related performance issues.
Note that all nodes on the cluster of interest, regardless of server hardware e.g. X5, X6, X7, and on any other clusters on the fabric that communicate with the current cluster need to be updated to have a consistent value of this parameter. Performance may be negatively impacted if 2 nodes that are communicating have different values i.e. one has the option enabled and one has the option disabled.
Edge nodes that are not on BDA nodes and are not directly connected to the BDA NM2 InfiniBand switches are not considered to be on the same InfiniBand fabric and are not affected by this.
Upgrading the HCA firmware to the latest supported on the BDA for the ConnectX-3 HCA model which is currently 2.35.5532.
See the discussion in: BDA HCA Firmware Upgrade Frequently Asked Questions (Doc ID 2551285.1).
Any BDA cluster experiencing performance slowdowns characterized by network timeouts or network-related issues.
To view full details, sign in with your My Oracle Support account.
Don't have a My Oracle Support account? Click to get started!
In this Document
|Frequently Asked Questions|
|Enabling the kernel option cm_ibcrc_as_csum|
|Upgrading the HCA firmware to the latest supported on the BDA for the ConnectX-3 HCA model which is currently 2.35.5532.|
|1. Enabling the kernel option cm_ibcrc_as_csum|
|2. Upgrading the HCA firmware to the latest supported on the BDA for the ConnectX-3 HCA model which is currently 2.35.5532|