My Oracle Support Banner

OCI Bare-Metal Instances can Exhibit Hard Hang during Live Scale-Up or Scale-Down. (Doc ID 2622370.1)

Last updated on DECEMBER 25, 2019

Applies to:

Linux OS - Version Oracle Linux 6.0 and later
Oracle Cloud Infrastructure - Version N/A and later
Information in this document applies to any platform.

Symptoms

In OCI, bare-metal (BM) instances, such as DBaaS BM instances, are offered with a live feature that can scale-up and scale-down the assigned number of virtual CPUs (vCPUs) or CPU threads.  But, such a live scale-up or scale-down process can often trigger a hard hang such that the only recovery is to reboot or hard-reset the machine in question.

Here is a view of /var/log/messages for an example hard hang triggered after doing a live scale-up from 6 CPU cores (i.e. 12 vCPUs) to 26 CPU cores (i.e. 52 vCPUs) on BM instance with template BM.DenseIO2.52.

Aug 13 14:59:55 ocibm kernel: [615822.752221] x86: Booting SMP configuration:
Aug 13 14:59:55 ocibm kernel: [615822.803203] smpboot: Booting Node 0 Processor 1 APIC 0x2
Aug 13 14:59:55 ocibm kernel: [615822.967802] smpboot: Booting Node 0 Processor 10 APIC 0x16
Aug 13 14:59:55 ocibm kernel: [615823.146675] smpboot: Booting Node 0 Processor 100 APIC 0x75
Aug 13 14:59:55 ocibm kernel: [615823.321838] smpboot: Booting Node 0 Processor 101 APIC 0x77
Aug 13 14:59:55 ocibm kernel: [615823.487702] smpboot: Booting Node 0 Processor 102 APIC 0x79
Aug 13 14:59:55 ocibm kernel: [615823.661331] smpboot: Booting Node 0 Processor 103 APIC 0x7b
Aug 13 14:59:56 ocibm kernel: [615823.837043] smpboot: Booting Node 0 Processor 11 APIC 0x18
Aug 13 14:59:56 ocibm kernel: [615824.014997] smpboot: Booting Node 0 Processor 12 APIC 0x1a
Aug 13 14:59:56 ocibm kernel: [615824.181711] smpboot: Booting Node 0 Processor 13 APIC 0x20
Aug 13 14:59:56 ocibm kernel: [615824.346364] smpboot: Booting Node 0 Processor 14 APIC 0x22
Aug 13 14:59:56 ocibm kernel: [615824.517618] smpboot: Booting Node 0 Processor 15 APIC 0x24
Aug 13 14:59:56 ocibm kernel: [615824.698378] smpboot: Booting Node 0 Processor 16 APIC 0x26
Aug 13 14:59:57 ocibm kernel: [615824.867260] smpboot: Booting Node 0 Processor 17 APIC 0x28
Aug 13 15:01:18 ocibm kernel: [615825.034174] smpboot: Booting Node 0 Processor 18 APIC 0x2a
Aug 13 15:01:18 ocibm kernel: [615825.044545] WARNING: CPU: 0 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.044592] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.048092] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.048158] WARNING: CPU: 0 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.048177] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.048333] WARNING: CPU: 0 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.048392] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.049630] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.049643] WARNING: CPU: 1 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.049750] WARNING: CPU: 13 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.049810] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.049935] WARNING: CPU: 1 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.374952] WARNING: CPU: 1 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615825.375070] WARNING: CPU: 1 PID: 13452 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.508626] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.508652] WARNING: CPU: 13 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.508791] WARNING: CPU: 20 PID: 136 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.508882] WARNING: CPU: 13 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.525281] WARNING: CPU: 13 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615833.525359] WARNING: CPU: 13 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615837.175242] WARNING: CPU: 15 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615837.175358] WARNING: CPU: 15 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615884.898920] INFO: rcu_sched detected stalls on CPUs/tasks: { 102} (detected by 1, t=60003 jiffies, g=94259555, c=94259554, q=0)
Aug 13 15:01:18 ocibm kernel: [615905.884264] smpboot: Booting Node 0 Processor 2 APIC 0x4

Typical stacks may look like this:

Aug 13 15:01:18 ocibm kernel: [615837.175357] ------------[ cut here ]------------
Aug 13 15:01:18 ocibm kernel: [615837.175358] WARNING: CPU: 15 PID: 50280 at block/blk-mq.c:759 __blk_mq_run_hw_queue+0x2df/0x390()
Aug 13 15:01:18 ocibm kernel: [615837.175382] Modules linked in: des_generic arc4 ecb md4 nls_utf8 cifs dns_resolver rds oracleacfs(PO) oraclead
vm(PO) oracleoks(PO) nfsd dummy nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace mptctl mptbase ib_iser rdma_cm ib
_cm iw_cm ib_sa ib_mad ib_core ib_addr vfat fat dm_multipath ipmi_ssif i2c_core ipmi_msghandler sg pcspkr shpchp acpi_cpufreq acpi_pad ext4 jbd2
mbcache2 sd_mod crc32c_intel be2iscsi bnx2i cnic bnxt_en uio nvme cxgb4i nvme_core cxgb4 cxgb3i xhci_pci xhci_hcd libcxgbi ipv6 cxgb3 mdio qla4
xxx iscsi_boot_sysfs wmi dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Aug 13 15:01:18 ocibm kernel: [615837.175383] CPU: 15 PID: 50280 Comm: kworker/18:2H Tainted: P W O 4.1.12-124.24.5.el6uek.x86_64 #2
Aug 13 15:01:18 ocibm kernel: [615837.175384] Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42060300 04/15/201
9
Aug 13 15:01:18 ocibm kernel: [615837.175386] Workqueue: kblockd blk_mq_run_work_fn
Aug 13 15:01:18 ocibm kernel: [615837.175387] 0000000000000000 ffff8818e65afcf8 ffffffff816ea590 0000000000000000
Aug 13 15:01:18 ocibm kernel: [615837.175389] ffffffff81a0bd51 ffff8818e65afd38 ffffffff8108601a 0000000000000001
Aug 13 15:01:18 ocibm kernel: [615837.175390] ffff88bcbd831000 ffff8818e65afd80 ffff88bd3f298080 00000000000004a0
Aug 13 15:01:18 ocibm kernel: [615837.175391] Call Trace:
Aug 13 15:01:18 ocibm kernel: [615837.175393] [<ffffffff816ea590>] dump_stack+0x63/0x81
Aug 13 15:01:18 ocibm kernel: [615837.175395] [<ffffffff8108601a>] warn_slowpath_common+0x8a/0xc0
Aug 13 15:01:18 ocibm kernel: [615837.175397] [<ffffffff8108610a>] warn_slowpath_null+0x1a/0x20
Aug 13 15:01:18 ocibm kernel: [615837.175398] [<ffffffff81306f2f>] __blk_mq_run_hw_queue+0x2df/0x390
Aug 13 15:01:18 ocibm kernel: [615837.175400] [<ffffffff816ecf6a>] ? __schedule+0x24a/0x810
Aug 13 15:01:18 ocibm kernel: [615837.175402] [<ffffffff816ecf6a>] ? __schedule+0x24a/0x810
Aug 13 15:01:18 ocibm kernel: [615837.175404] [<ffffffff81307392>] blk_mq_run_work_fn+0x12/0x20
Aug 13 15:01:18 ocibm kernel: [615837.175406] [<ffffffff810a02f9>] process_one_work+0x169/0x4a0
Aug 13 15:01:18 ocibm kernel: [615837.175409] [<ffffffff810a0cbc>] worker_thread+0x1ec/0x560
Aug 13 15:01:18 ocibm kernel: [615837.175411] [<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
Aug 13 15:01:18 ocibm kernel: [615837.175413] [<ffffffff810a68fb>] kthread+0xcb/0xf0
Aug 13 15:01:18 ocibm kernel: [615837.175415] [<ffffffff816ecf6a>] ? __schedule+0x24a/0x810
Aug 13 15:01:18 ocibm kernel: [615837.175416] [<ffffffff816ecf6a>] ? __schedule+0x24a/0x810
Aug 13 15:01:18 ocibm kernel: [615837.175418] [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
Aug 13 15:01:18 ocibm kernel: [615837.175420] [<ffffffff816f3067>] ret_from_fork+0x47/0x90
Aug 13 15:01:18 ocibm kernel: [615837.175422] [<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
Aug 13 15:01:18 ocibm kernel: [615837.175423] ---[ end trace e8f395c5cef38887 ]---
Aug 13 15:01:18 ocibm kernel: [615884.898920] INFO: rcu_sched detected stalls on CPUs/tasks: { 102} (detected by 1, t=60003 jiffies, g=94259555,
c=94259554, q=0)
Aug 13 15:01:18 ocibm kernel: [615884.898921] Task dump for CPU 102:
Aug 13 15:01:18 ocibm kernel: [615884.898924] tee R running task 0 50139 50137 0x00000088
Aug 13 15:01:18 ocibm kernel: [615884.898927] ffffffff81f1d188 ffffffffffffffff ffff880d8a18b988 ffffffff813321bf
Aug 13 15:01:18 ocibm kernel: [615884.898929] 000000000000001c 0000000000ffff0a ffff880d8a18b938 00000010001e9f0c
Aug 13 15:01:18 ocibm kernel: [615884.898931] ffff353038383638 ffff88bd4079c380 000080d0f00be600 000000004f7c95cd
Aug 13 15:01:18 ocibm kernel: [615884.898932] Call Trace:
Aug 13 15:01:18 ocibm kernel: [615884.898940] [<ffffffff813321bf>] ? number+0x32f/0x370
Aug 13 15:01:18 ocibm kernel: [615884.898943] [<ffffffff81336280>] ? delay_tsc+0x30/0x70
Aug 13 15:01:18 ocibm kernel: [615884.898946] [<ffffffff813361cd>] ? __const_udelay+0x2d/0x30
Aug 13 15:01:18 ocibm kernel: [615884.898950] [<ffffffff8143dc61>] ? wait_for_xmitr+0x41/0xb0
Aug 13 15:01:18 ocibm kernel: [615884.898954] [<ffffffff816f40c4>] ? apic_timer_interrupt+0xe4/0x1a0
Aug 13 15:01:18 ocibm kernel: [615884.898957] [<ffffffff8143dcec>] ? serial8250_console_putchar+0x1c/0x40
Aug 13 15:01:18 ocibm kernel: [615884.898959] [<ffffffff8143dcd0>] ? wait_for_xmitr+0xb0/0xb0
Aug 13 15:01:18 ocibm kernel: [615884.898961] [<ffffffff8143839f>] ? uart_console_write+0x3f/0x70
Aug 13 15:01:18 ocibm kernel: [615884.898963] [<ffffffff8143fc68>] ? univ8250_console_write+0x128/0x370
Aug 13 15:01:18 ocibm kernel: [615884.898966] [<ffffffff810dda2c>] ? print_time.part.12+0x6c/0x90
Aug 13 15:01:18 ocibm kernel: [615884.898968] [<ffffffff810ddabf>] ? print_prefix+0x6f/0xb0
Aug 13 15:01:18 ocibm kernel: [615884.898970] [<ffffffff810de325>] ? call_console_drivers.constprop.29+0xb5/0x120
Aug 13 15:01:18 ocibm kernel: [615884.898972] [<ffffffff810dfc1d>] ? console_unlock+0x2ed/0x470
Aug 13 15:01:18 ocibm kernel: [615884.898974] [<ffffffff810e0160>] ? vprintk_emit+0x3c0/0x570
Aug 13 15:01:18 ocibm kernel: [615884.898976] [<ffffffff810e046f>] ? vprintk_default+0x1f/0x30
Aug 13 15:01:18 ocibm kernel: [615884.898979] [<ffffffff816e8827>] ? printk+0x49/0x4b
Aug 13 15:01:18 ocibm kernel: [615884.898983] [<ffffffff81053c77>] ? native_cpu_up+0x517/0xb40
Aug 13 15:01:18 ocibm kernel: [615884.898986] [<ffffffff810866aa>] ? _cpu_up+0x19a/0x1d0
Aug 13 15:01:18 ocibm kernel: [615884.898989] [<ffffffff810867c2>] ? cpu_up+0xe2/0x100
Aug 13 15:01:18 ocibm kernel: [615884.898992] [<ffffffff816e3271>] ? cpu_subsys_online+0x41/0xa0
Aug 13 15:01:18 ocibm kernel: [615884.898996] [<ffffffff81479440>] ? device_online+0x70/0xa0
Aug 13 15:01:18 ocibm kernel: [615884.898998] [<ffffffff816f40c4>] ? apic_timer_interrupt+0xe4/0x1a0
Aug 13 15:01:18 ocibm kernel: [615884.898999] [<ffffffff814794ed>] ? online_store+0x7d/0x90
Aug 13 15:01:18 ocibm kernel: [615884.899001] [<ffffffff8147666b>] ? dev_attr_store+0x1b/0x30
Aug 13 15:01:18 ocibm kernel: [615884.899006] [<ffffffff8128ed53>] ? sysfs_kf_write+0x43/0x50
Aug 13 15:01:18 ocibm kernel: [615884.899008] [<ffffffff8128df2a>] ? kernfs_fop_write+0x12a/0x190
Aug 13 15:01:18 ocibm kernel: [615884.899010] [<ffffffff816f4046>] ? apic_timer_interrupt+0x66/0x1a0
Aug 13 15:01:18 ocibm kernel: [615884.899014] [<ffffffff8120b61b>] ? __vfs_write+0x2b/0x110
Aug 13 15:01:18 ocibm kernel: [615884.899016] [<ffffffff8120e369>] ? __sb_start_write+0x49/0x100
Aug 13 15:01:18 ocibm kernel: [615884.899019] [<ffffffff812b4409>] ? security_file_permission+0x29/0xb0
Aug 13 15:01:18 ocibm kernel: [615884.899021] [<ffffffff8120bd79>] ? vfs_write+0xa9/0x1b0
Aug 13 15:01:18 ocibm kernel: [615884.899024] [<ffffffff8120cb86>] ? SyS_write+0x46/0xb0
Aug 13 15:01:18 ocibm kernel: [615884.899025] [<ffffffff816f2b5f>] ? system_call_after_swapgs+0xe9/0x190
Aug 13 15:01:18 ocibm kernel: [615884.899027] [<ffffffff816f2b58>] ? system_call_after_swapgs+0xe2/0x190
Aug 13 15:01:18 ocibm kernel: [615884.899029] [<ffffffff816f2b51>] ? system_call_after_swapgs+0xdb/0x190
Aug 13 15:01:18 ocibm kernel: [615884.899031] [<ffffffff816f2c1e>] ? system_call_fastpath+0x18/0xd8

In this example, by the time the CPU count in /proc/cpuinfo reaches 27 (i.e. 12 + 15), the machine is hard hung, thus needing a reboot or a hard-reset for recovery.

During the hard hang, the number of processes inside /proc is increasing rapidly and, after some time, the /proc/loadavg can show load averages exceeding 2000 or 3000.

# grep '^processor' /proc/cpuinfo | wc -l
27
#
# ls -1 /proc | grep -E '^[0-9]' | wc -l
6789
#
# cat /proc/loadav
3758.xx ...
#

Commands like cat(1) and ls(1) may continue to work.  But, other commands like mpstat(1), ps(1), top(1), and vmstat(8) will hanging indefinitely.

Changes

N/A

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.