My Oracle Support Banner

[PCA 2.x] OCFS2 Corruption Cause The Compute Node Hang And Processes In D State (Doc ID 2582251.1)

Last updated on JULY 25, 2023

Applies to:

Private Cloud Appliance - Version 2.3.3 to 2.3.4 [Release 2.0]
Private Cloud Appliance X5-2 Hardware - Version All Versions to All Versions [Release All Releases]
Linux x86-64

Symptoms

From /var/log/message from a compute node node::

=====================================

kernel: [852841.931594] BD2: no valid journal superblock found
kernel: [852841.931597] (ocfs2rec,18280,4):ocfs2_replay_journal:1620 ERROR: status = -22
kernel: [852841.931603] (ocfs2rec,18280,4):ocfs2_recover_node:1704 ERROR: status = -22
kernel: [852841.931607] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1411 ERROR: Error -22 recovering node 9 on device (249,34)!
kernel: [852841.931613] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1412 ERROR: Volume requires unmount.
kernel: [852841.931755] ocfs2: Begin replay journal (node 9, slot 8) on device (249,34)
kernel: [852841.933037] BD2: no valid journal superblock found
kernel: [852841.933039] (ocfs2rec,18280,4):ocfs2_replay_journal:1620 ERROR: status = -22
kernel: [852841.933045] (ocfs2rec,18280,4):ocfs2_recover_node:1704 ERROR: status = -22
kernel: [852841.933049] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1411 ERROR: Error -22 recovering node 9 on device (249,34)!
kernel: [852841.933055] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1412 ERROR: Volume requires unmount.
kernel: [852841.933255] ocfs2: Begin replay journal (node 9, slot 8) on device (249,34)
kernel: [852841.934538] BD2: no valid journal superblock found
kernel: [852841.934540] (ocfs2rec,18280,4):ocfs2_replay_journal:1620 ERROR: status = -22
kernel: [852841.934546] (ocfs2rec,18280,4):ocfs2_recover_node:1704 ERROR: status = -22
kernel: [852841.934551] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1411 ERROR: Error -22 recovering node 9 on device (249,34)!
kernel: [852841.934556] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1412 ERROR: Volume requires unmount.
kernel: [852841.934713] ocfs2: Begin replay journal (node 9, slot 8) on device (249,34)
kernel: [852841.935983] BD2: no valid journal superblock found
kernel: [852841.935997] (ocfs2rec,18280,4):ocfs2_replay_journal:1620 ERROR: status = -22
kernel: [852841.936003] (ocfs2rec,18280,4):ocfs2_recover_node:1704 ERROR: status = -22
kernel: [852841.936007] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1411 ERROR: Error -22 recovering node 9 on device (249,34)!
kernel: [852841.936013] (ocfs2rec,18280,4):__ocfs2_recovery_thread:1412 ERROR: Volume requires unmount.
kernel: [852841.936208] ocfs2: Begin replay journal (node 9, slot 8) on device (249,34)
kernel: [852841.937443] BD2: no valid journal superblock found
kernel: [852841.937445] (ocfs2rec,18280,4):ocfs2_replay_journal:1620 ERROR: status = -22
kernel: [852841.937451] (ocfs2rec,18280,4):ocfs2_recover_node:1704 ERROR: status = -22

 

From output of the multipath command from compute node:

=======================================

# multipath -ll 3600144f0926a067f00005be088810008
3600144f0926a067f00005be088810008 dm-4 SUN,ZFS Storage 7350
size=10T features='0' hwhandler='1 alua' wp=rw
'-+- policy='round-robin 0' prio=0 status=enabled
'- 14:0:0:13 sdi 8:128 failed faulty running

 

From AdminServer.lgo file from the active management node:

=============================================

####<2019-08-23T06:23:47.781+0200> <Info> <com.oracle.ovm.mgr.event.ovs.Storage> <ovcamn05r1> <AdminServer> <EventProcessor-5> <> <> <e3c3f743-8a1a-4c3f-8417-cc9f599aa141-00000004> <1566534227781> <BEA-000000> <Server: ovcacn28r1, finished processing storage notification: Aug 23 06:23:46 {STORAGE} [CHANGE_DM_SD] (dm-13) 3600144f0926a067f00005c508b380011-14:0:0:27 (failed:iqn.1986-03.com.sun:02:71961d53-82b1-4595-e712-9a64586b5462,2:3600144f0926a067f00005c508b380011)>
####<2019-08-23T06:23:48.807+0200> <Info> <com.oracle.ovm.mgr.event.ovs.Storage> <ovcamn05r1> <AdminServer> <EventProcessor-5> <> <> <e3c3f743-8a1a-4c3f-8417-cc9f599aa141-00000004> <1566534228807> <BEA-000000> <Server: ovcacn31r1, finished processing storage notification: Aug 23 06:23:47 {STORAGE} [CHANGE_DM_SD] (dm-16) 3600144f0926a067f00005be08b63000c-13:0:0:21 (failed:iqn.1986-03.com.sun:02:71961d53-82b1-4595-e712-9a64586b5462,2:3600144f0926a067f00005be08b63000c)>

 

 

Dry File System check on the OCFS2 disk:

========================
# fsck.ocfs2 -fn /dev/sdc
fsck.ocfs2 1.8.6
Checking OCFS2 filesystem in /dev/sdc:
Label: OVS64dd4e49e4a52
UUID: 0004FB000005000064364DD4E49E4A52
Number of blocks: 2603824128
Block size: 4096
Number of clusters: 10171188
Cluster size: 1048576
Number of slots: 32

** Skipping journal replay because -n was given. There may be spurious errors that journal replay would fix. **
** Skipping slot recovery because -n was given. **
/dev/sdc was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks
Pass 2: Checking directory entries
[DX_TREE_MISSING] Directory 514 is missing index. Rebuild? n  =======>> Dry run shows there are some indexes missing
Pass 3: Checking directory connectivity
Pass 4a: Checking for orphaned inodes
Pass 4b: Checking inodes link counts
All passes succeeded.

Changes

 N/A

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.