ILOM sending occasional incorrect sensor readings via IPMI when being polled by hwmgmtd on BDA V4.2 (Doc ID 2081723.1)

Last updated on JULY 12, 2016

Applies to:

Big Data Appliance X4-2 Hardware - Version All Versions and later
Linux x86-64

Symptoms

The problem symptoms are as follows:

1. On BDA V4.2, X4-2 HW, OS OL 6.6, with ILOM version: Version 3.1.2.32 Copyright (c) 2006, 2013, the following ILOM  "Temperature", "Fan Speed" and "Other" Warnings are periodically raised like:

Sep 21 16:12:36 bdanode03 hwmgmtd[13804]: State change: overall alarm state changed from "Cleared" (1) to "Critical" (2).
Sep 21 16:12:36 bdanode03 hwmgmtd[13804]: State change: alarm state of subsystem "Temperature" changed state from "Cleared" (1) to "Critical" (2).
Sep 21 16:12:36 bdanode03 hwmgmtd[13804]: State change: alarm state of subsystem "Fan Speed" changed state from "Cleared" (1) to "Critical" (2).
Sep 21 16:12:36 bdanode03 hwmgmtd[13804]: State change: alarm state of subsystem "Other" changed state from "Cleared" (1) to "Major" (3).
Sep 21 16:13:14 bdanode03 modprobe: WARNING: Deprecated config file /etc/modprobe.conf, all config files belong into /etc/modprobe.d/.
Sep 21 16:14:20 bdanode03 modprobe: WARNING: Deprecated config file /etc/modprobe.conf, all config files belong into /etc/modprobe.d/.
Sep 21 16:15:13 bdanode03 hwmgmtd[13804]: State change: overall alarm state changed from "Critical" (2) to "Cleared" (1).
Sep 21 16:15:13 bdanode03 hwmgmtd[13804]: State change: alarm state of subsystem "Temperature" changed state from "Critical" (2) to "Cleared" (1).
Sep 21 16:15:13 bdanode03 hwmgmtd[13804]: State change: alarm state of subsystem "Fan Speed" changed state from "Critical" (2) to "Cleared" (1).
...


2. Searching for hwmgmtd in /var/log/messages also shows lots of related errors like:

# grep hwmgmtd /var/log/messages
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: indicator: /SYS/MB/FM0/OK (ID: 208) changed state from "On" (4) to "Off" (3).
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: indicator: /SYS/MB/FM1/OK (ID: 209) changed state from "On" (4) to "Off" (3).
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: service indicator: /SYS/SERVICE (ID: 213) changed state from "Off" (3) to "On" (4).
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: locator indicator: /SYS/LOCATE (ID: 214) changed state from "Off" (3) to "On" (4).
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: indicator: /SYS/SP/OK (ID: 215) changed state from "On" (4) to "Off" (3).
Oct 25 09:08:52 bdanode03 hwmgmtd[12805]: State change: indicator: /SYS/PS_FAULT (ID: 217) changed state from "Off" (3) to "On" (4).
Oct 25 09:09:54 bdanode03 hwmgmtd[12805]: State change: indicator: /SYS/MB/FM0/OK (ID: 208) changed state from "Off" (3) to "On" (4).
...

3. But the ILOM snapshot shows: the Fault leds are off, the fma did not log any fault, and the sel events are clear as well.

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms