HSM - Archiver Deamon Is Stalled With sam-arfind Dumping Core and sam-arcopy Exiting
(Doc ID 2496368.1)
Last updated on JANUARY 06, 2025
Applies to:
Oracle Hierarchical Storage Manager (HSM) and StorageTek QFS Software - Version 6.1 to 6.1 [Release 6.0]Information in this document applies to any platform.
Symptoms
The sam-archiverd daemon is stalled and dumping core.
The sam-arcopy processes are exiting prematurely.
Output of the 'samcmd a' command is showing "Waiting for scheduler error"
The command 'showqueue -v' is showing "Drives busy" but 'samcmd r' shows a lot of drives are available and empty.
The sam-arfind daemon is dumping core with stack trace like the following:
sam-arfind samfs1
------------ lwp# 1 / thread# 1 ---------------
ffffffff7e04e2f4 ioctl (a0, ffffffff7ffff878, 40, ffffffff7ffff560, 0, 8080808080808080) + c
0000000100016178 FsAct (0, 20122c00, 100022560, 7ffffc00, 8080808080808080, 22) + 318
0000000100007c14 main (2, ffffffff7ffffa48, ffffffff7ffffa60, ffffffff7c000180, 2880, 2800) + 1b4
000000010000775c _start (0, 0, 0, 0, 0, 0) + 17c
------------ lwp# 2 / thread# 2 ---------------
ffffffff7e04dc70 __sigtimedwait (ffffffff7e86b350, 0, 0, ffffffff7bdfbcb0, 0, ffffffff7bb00240) + 8
ffffffff7e022a30 __posix_sigwait (ffffffff7e86b350, ffffffff7bdfbf28, 118, 0, ffffffff7df59da3, 178) + 18
ffffffff7e7496b8 signalWaiter (0, 0, 0, ffffffff7e749670, 0, 1) + 48
ffffffff7e04987c _lwp_start (0, 0, 0, 0, 0, 0)
------------ lwp# 3 / thread# 3 ---------------
ffffffff7ef0683c ArchReqFileName (0, ffffffff7b4fb09a, 0, 1, 0, 0) + 3c
ffffffff7ef078f8 ArchReqValid (0, 0, 148cd0, 0, 0, ffffffff7e18a000) + 28
000000010000a81c Archive (100125690, 0, 0, 10000a3e0, 0, 1) + 43c
ffffffff7e04987c _lwp_start (0, 0, 0, 0, 0, 0)
------------ lwp# 4 / thread# 4 ---------------
ffffffff7e0498fc __lwp_park (100125980, 100125960, ffffffff7b2fbbf0, 10012596c, 0, ffffffff7e18a000) + 14
ffffffff7e0430b4 cond_wait_queue (100125980, 100125960, ffffffff7b2fbbf0, 0, 0, 0) + 4c
ffffffff7e043590 cond_wait_common (100125980, 100125960, ffffffff7b2fbbf0, 2946, 0, 0) + 28c
ffffffff7e043760 __cond_timedwait (100125980, 100125960, ffffffff7b2fbe28, 0, ffffffff7b2fbbf0, 0) + 60
ffffffff7e043824 cond_timedwait (100125980, 100125960, ffffffff7b2fbe28, 526f5, 74c3f3, 3b9aca00) + 14
ffffffff7e043864 pthread_cond_timedwait (100125980, 100125960, ffffffff7b2fbe28, ffffffff7b2fbd40, 0, 0) + c
ffffffff7ef0c73c ThreadsCondTimedWait (100125980, 100125960, 5bc6d80a, 0, 0, ffffffff7e18a000) + 94
000000010001502c ExamInodes (1001256b8, 0, 0, 100014ef0, 0, 1) + 13c
ffffffff7e04987c _lwp_start (0, 0, 0, 0, 0, 0)
------------ lwp# 5 / thread# 5 ---------------
ffffffff7e0498fc __lwp_park (100125a20, 100125a08, 0, 100125a14, 0, ffffffff7e18a000) + 14
ffffffff7e0430b4 cond_wait_queue (100125a20, 100125a08, 0, 0, ffffffff7e190fc0, 0) + 4c
ffffffff7e0436a8 cond_wait (100125a20, 100125a08, 0, 10001695c, 0, 0) + 10
ffffffff7e0436e4 pthread_cond_wait (100125a20, 100125a08, 148cd0, 0, 0, ffffffff7e18a000) + 8
000000010001695c FsExamine (100125708, 0, 0, 1000167c0, 0, 1) + 19c
ffffffff7e04987c _lwp_start (0, 0, 0, 0, 0, 0)
------------ lwp# 6 / thread# 6 ---------------
ffffffff7e04dd64 __so_accept (6, ffffffff7aefbdf2, ffffffff7aefbde4, ffffffff7bb02240, 2e6800, 6) + 8
ffffffff7e74f938 UdsRecvMsg (100125b98, 0, 100125b98, 10012ab7c, 100023a50, d) + 208
000000010001c420 Message (100125730, 0, 0, 10001c308, 0, 1) + 118
ffffffff7e04987c _lwp_start (0, 0, 0, 0, 0, 0)
...
The showqueue output shows a lot of files being scheduled and drives are busy:
Filesystem HSM:
Files waiting to start: 0
Files being scheduled: 1,488 <----------------------
Files archiving: 0
Events processed: 5,197,965
archive 0
change 14,057
close 5,179,102
create 1,724
hwm 0
modify 0
rearchive 0
rename 860
remove 862
unarchive 1,360
internal 0
Scan list Examine: noscan
0 2018-10-23 00:00:00 background ---- inodes
Archive requests
samfs.fs_A_XX.2.3657 schedule 2018-10-20 11:20:57 drives:4
files:1488 space: 110.936G flags:
(min: 18.500k) priority: 0 0
Drives busy <------------------------------
Archive file 1:
type:f ino:704705 s:1/f:0 space: 19.500k time:1539754074 priority:0
hsm/20181013/0XXXXb4.blb
type:f ino:704711 s:1/f:0 space: 6.016M time:1539754083 priority:0
hsm/20181013/0XXXXX1.blb
type:f ino:704696 s:1/f:0 space: 277.636M time:1539754066 priority:0
hsm/20181014/0XXXXX.blb
type:f ino:704674 s:1/f:0 space: 1.306G time:1539754027 priority:0
hsm/20181014/0XXXXX.blb
....
Note that most of the tape drives are available:
Removable media samcmd 6.1.45 10:55:38 Oct 22 2018
samcmd on HSM
Removable media status: all
ty eq status act use state vsn
sg 51 ---------p 0 0% notrdy
empty
sg 52 ---------p 0 0% notrdy
empty
sg 53 ---------p 0 0% notrdy
empty
sf 61 ---------p 0 0% notrdy
empty
sf 62 ---------p 0 0% notrdy
empty
sf 63 ---------p 0 0% notrdy
empty
sf 64 ---------p 0 0% notrdy
empty
ti 71 --l----oPr 2 100% ready MXX68
Positioning to 0x1a7068|2b 00 00 00 1a 70 68 00 00 00
ti 72 ---------p 0 0% notrdy
empty
ti 73 ---------p 0 0% notrdy
empty
ti 74 --l----oPr 2 100% ready MXXX2
Positioning to 0x26a2f1|2b 00 00 00 26 a2 f1 00 00 00
ti 75 ---------p 0 0% notrdy
empty
ti 76 ---------p 0 0% notrdy
empty
ti 81 ---------p 0 0% notrdy
empty
tp 82 ---------- 0 0% off
empty
ti 83 ---------p 0 0% notrdy
empty
ti 84 ---------p 0 0% notrdy
empty
ti 85 ---------p 0 0% notrdy
empty
sam-archiverd:
Oct 13 02:32:55 samfs sam-arcopy[25488]: [ID 702911 local7.error] err
Fatal OS call error: ArchReqAttach(misc1.1.4) called from: arcopy.c:710: No such file or directory
Oct 13 03:11:14 samfs klmmod: [ID 710424 kern.notice] NOTICE: Received
NLM_FREE_ALL (DS3000) from 010.030.208.145
Oct 13 03:45:00 samfs ZFS-status: [ID 702911 cron.alert] Sat Oct 13
03:45:00 EDT 2018 /root/zpXXXXeck.sh: Ended with OK status
Oct 13 10:09:49 samfs af-samfs3[23296]: [ID 702911 local7.error] err
Fatal OS call error: MapFileGrow(samfs3.fs_A_DF.1.80049) called from: archreq.c:144
Oct 14 01:32:50 samfs af-samfs4[23295]: [ID 702911 local7.error] err
Fatal OS call error: MapFileGrow(samfs4.fs_P_DF.2.2) called from: archreq.c:144
...
Certain ArchReqs are set to NULL.
ArchReqs are scheduled and then dequeued a short time later:
2018-10-22 10:51:34 samfs1[12324:3]: DequeueArchReq(prod.1.27551)
2018-10-22 10:51:34 samfs1[12324:3]: Archreq queue (prod.1.27552) files: 1442 space: 100.483G
2018-10-22 10:51:34 sam-archiverd[18081:3]: samfs.XXXod.1.27552 sorted by path
2018-10-22 10:51:34 sam-archiverd[18081:1]: Scheduling archives
2018-10-22 10:51:34 sam-archiverd[18081:1]: samfs5.R.2.0 divided for 2 drives
2018-10-22 10:51:34 sam-archiverd[18081:1]: Scheduling archives
2018-10-22 10:51:34 sam-archiverd[18081:1]: samfs5.R.2.0 divided for 2 drives
2018-10-22 10:51:39 sam-archiverd[18081:4]: schedQ Dequeue(samfs.XXXod.1.27552 0x760f8)
2018-10-22 10:51:39 samfs1[12324:3]: DequeueArchReq(XXXod.1.27552)
2018-10-22 10:51:39 samfs1[12324:3]: Archreq queue (XXXod.1.27553) files: 1328 space: 100.096G
2018-10-22 10:51:39 sam-archiverd[18081:3]: samfs.XXXod.1.27553 sorted by path
2018-10-22 10:51:39 sam-archiverd[18081:1]: Scheduling archives
2018-10-22 10:51:39 sam-archiverd[18081:1]: samfs5.R.2.0 divided for 2 drives
2018-10-22 10:51:39 sam-archiverd[18081:1]: Scheduling archives
2018-10-22 10:51:39 sam-archiverd[18081:1]: samfs5.R.2.0 divided for 2 drives
2018-10-22 10:51:44 sam-archiverd[18081:4]: schedQ Dequeue(samfs.XXXod.1.27553 0x760f8)
2018-10-22 10:51:44 af-samfs1[12324:3]: DequeueArchReq(XXXod.1.27553)
..
Changes
No known change in the system
Cause
To view full details, sign in with your My Oracle Support account. |
|
Don't have a My Oracle Support account? Click to get started! |
In this Document
Symptoms |
Changes |
Cause |
Solution |
References |