My Oracle Support Banner

High CPU Usage Of imapd (Doc ID 1935176.1)

Last updated on JUNE 29, 2017

Applies to:

Oracle Communications Messaging Server - Version 7.0.5 to 8.0.0 [Release 7.0.0 to 8.0.0]
Information in this document applies to any platform.

Symptoms

Initial problem description: High CPU usage by imapd processes.

At a specific day and time (every monday morning at 8 am) the cpu usage raises to almost 100% for a short period of time - about 10 minutes.

From the operating system perspective, this is not really a problem, it is just doing its work. In fact nothing to worry about. The end-users are not noticing any performance degradation.

The load averages are increasing from normally relatively low:

08110715/prstat.out:Total: 199 processes, 1024 lwps, load averages: 6.49, 5.48, 4.36
08110715/prstat.out:Total: 204 processes, 1021 lwps, load averages: 6.53, 5.50, 4.38

to:

08110759/prstat.out:Total: 277 processes, 1411 lwps, load averages: 101.46, 71.16, 43.61
08110759/prstat.out:Total: 273 processes, 1414 lwps, load averages: 103.95, 72.27, 44.17
08110759/prstat.out:Total: 274 processes, 1422 lwps, load averages: 105.32, 73.07, 44.59
08110759/prstat.out:Total: 274 processes, 1426 lwps, load averages: 106.60, 73.86, 45.01
08110759/prstat.out:Total: 275 processes, 1393 lwps, load averages: 107.72, 74.63, 45.43
08110759/prstat.out:Total: 268 processes, 1371 lwps, load averages: 108.60, 75.35, 45.83
08110804/prstat.out:Total: 265 processes, 1257 lwps, load averages: 74.89, 87.72, 59.12

Can we determine more specifically what the imapd processes are doing?

The vmstat output shows us the following:

kthr      memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr m0 m1 m2 m1   in   sy   cs us sy id
0 4 0 65003840 16761520 19358 6712 75581 24 23 0 0 7 6 6 19 48424 152160 65080 5 5 90
35 10 0 64695376 14421688 50943 21422 220186 12 12 0 0 20 10 10 0 123053 830200 206332 42 58 0
43 9 0 64884992 14486552 50226 19984 194391 21 19 0 0 5 3 3 0 125440 833326 209844 43 56 0
20 9 0 64776008 14735560 45435 20333 181174 3 3 0 0 4 3 3 0 133133 808160 206783 47 52 2
25 10 0 64874552 14677728 44734 19423 171628 0 0 0 0 1 1 1 0 125355 833720 206866 41 58 0
22 11 0 64855328 14649776 47076 18901 188412 11 11 0 0 1 1 1 0 133534 805234 208260 46 52 1
56 10 0 64843944 14433480 48917 19165 191595 0 0 0 0 6 3 4 0 122588 841402 207352 43 56 0
21 11 0 64740256 14632312 50579 19288 177710 3 3 0 0 2 2 1 0 136365 814893 208663 45 54 2
26 9 0 64846000 14772096 47910 18337 159038 2 0 0 0 4 2 2 0 125972 806246 201941 47 52 1
39 9 0 64840304 14501224 47668 21036 178464 5 5 0 0 11 6 6 0 123431 811882 200949 47 52 1
35 10 0 64843872 14330168 42859 23295 157385 3 3 0 0 23 12 12 0 123231 832752 202742 46 54 0
59 10 0 64845272 14313232 46855 19245 184655 207 207 0 0 2 1 2 0 121656 866612 207329 42 58 0

Using awk to reformat to make that a bit easier to read:

$ awk '{printf("%2s %2s %2s %2s %2s\n", $1, $2, $(NF-2), $(NF-1), $NF)}'
 r  b us sy id
 0  4  5  5 90
35 10 42 58  0
43  9 43 56  0
20  9 47 52  2
25 10 41 58  0
22 11 46 52  1
56 10 43 56  0
21 11 45 54  2
26  9 47 52  1
39  9 47 52  1
35 10 46 54  0
59 10 42 58  0

The long-term average (the first row) is 5% user-mode, 5% system-mode and 90% idle.

But during these spike periods, there is a significant run queue (the 'r' column) and 0% idle

Almost all CPU time is spent processing user-mode threads and system-calls (about 50-50).

With prstat we see the following processes on top:

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 1472 mailsrv   703M  208M cpu20   32    0   2:40:05 3.1% imapd/13
 1460 mailsrv   598M  261M cpu34   42    0   2:43:43 2.5% imapd/12
 1478 mailsrv   590M  230M cpu43   52    0   2:46:06 2.5% imapd/11
 1471 mailsrv   547M  147M cpu59   52    0   2:39:05 2.5% imapd/12
 1481 mailsrv   565M  161M cpu15   41    0   2:42:28 2.2% imapd/13
 1462 mailsrv   626M  237M cpu0    43    0   2:38:13 2.2% imapd/11
 1482 mailsrv   636M  192M cpu13   41    0   2:42:28 2.1% imapd/11


Please note, in this case, the service.imap.numprocesses = 32, so there were 25 more imapd processes than shown above.

So the problem statement is:
On every monday morning at about 8 am the imapd processes consumes almost all cpu, with about 50% in user-mode threads and 50% in system-calls.

It is not I/O related, because in the vmstat output we see a high number of threads on the dispatcher queue (the 'r' column) and, as a side effect, we see non-zero number for
the number of blocked kernel threads that are waiting for resources I/O, paging,and so forth (column 'b').

Also, the delivery_perf.d DTrace script shows:


2014 Aug 11 07:57:14 **********************

                              times are in microseconds
                          (divide by 1000 for milliseconds)
FStype/syscall       count       min       avg       max  Process
tmpfs/close             1        13        13        13  stored
tmpfs/write             1        78        78        78  stored
 vxfs/pwrite            1     10756     10756     10756  ims_master
 vxfs/fdsync           13        13     11720    118437  stored
 vxfs/close           557         6        18       250  stored
 vxfs/write          4380         5      1182     28284  ims_master
 vxfs/pwrite         4500         6        70     21050  stored
 vxfs/close          6333         6        23       712  ims_master
 vxfs/writev         7917        13      1351     77362  ims_master
 vxfs/fdsync        10134       470      1902    107645  ims_master
 vxfs/fdsync        20527        10      1321    108653  imapd
 vxfs/writev        33341         7       100     54546  imapd
 vxfs/write        102660         5        81     46205  imapd
 vxfs/close        279356         4        20      2520  imapd


The averages for the imapd related system calls are max 1,3 milliseconds, so overall this looks good.

It is just about busy imapd processes for a short period of time, after this period the system returns to normal behavior.
But it is interesting to know these processes are actually doing during this period.

In other words:

What are the user-mode threads of imapd doing, which is causing the high CPU usage every Monday morning?

For this we used a modified version of the hot-smtp.d DTrace script and changed it to hot-imap.d:

#!/usr/sbin/dtrace -qs

/*
** Use profile probes (fires on every CPU at the specified rate)
** to sample when the specified process is on the CPU
** and report the most frequently occuring stack traces.
** Report and clear the counters once per minutes.
*/

BEGIN{
   printf("%Y monitoring... will report once per minute. Ctrl/C to end.\n", walltimestamp);
}

/* for target processes user-mode CPU */
profile-997
/execname=="imapd" && arg0==0/
{
   @u[ustack()]=count();
}

/* for target processes system-mode CPU */
profile-997
/execname=="imapd" && arg1==0/
{
   @s[stack()]=count();
}


/* once per minute and when the script is stopped, print report */
tick-60s,
END
{
   trunc(@u,20);
   trunc(@s,20);
   printf("%Y\n", walltimestamp);
   printf("top 20 on-CPU user-mode stacks:");
   printa(@u);
   printf("\ntop 20 on-CPU system-mode stacks:");
   printa(@s);
   clear(@u);
   clear(@s);
}

The output of above script:

2014 Aug 18 08:15:20 monitoring... will report once per minute. Ctrl/C to end.
2014 Aug 18 08:16:19
top 20 on-CPU user-mode stacks:

             imapd`__1cLindex_check6FpnIimap_ctx_pnHmailbox_iiIIi_v_+0x15b0
             imapd`__1cJcmd_fetch6FpnIimap_ctx_pc2i_v_+0x5e4
             imapd`__1cLprocess_cmd6FpnIimap_ctx_ipGpv_v2_i_+0xfa0
             imapd`firstcmdline+0xd2c
             libasock.so`GDispCx_Dispatch+0x1ac
             libasock.so`GDispCx_InternalWork+0x3ac
             libc.so.1`_lwp_start
            3659

             imapd`__1cLindex_check6FpnIimap_ctx_pnHmailbox_iiIIi_v_+0x15b0
             imapd`__1cJcmd_fetch6FpnIimap_ctx_pc2i_v_+0x5e4
             imapd`__1cLprocess_cmd6FpnIimap_ctx_ipGpv_v2_i_+0xfa0
             imapd`firstcmdline+0xd2c
             libasock.so`GDispCx_Dispatch+0x1ac
             libasock.so`GDispCx_InternalWork+0x3ac
             libc.so.1`_lwp_start
            3711


Those two are identical. In fact, 17 of the top 20 thread stacks are exactly the same as above.

The other 3 are, two like this one:


             imapd`__1cLindex_check6FpnIimap_ctx_pnHmailbox_iiIIi_v_+0x15ac
             imapd`__1cJcmd_fetch6FpnIimap_ctx_pc2i_v_+0x5e4
             imapd`__1cLprocess_cmd6FpnIimap_ctx_ipGpv_v2_i_+0xfa0
             imapd`firstcmdline+0xd2c
             libasock.so`GDispCx_Dispatch+0x1ac
             libasock.so`GDispCx_InternalWork+0x3ac
             libc.so.1`_lwp_start
            3746

             imapd`__1cLindex_check6FpnIimap_ctx_pnHmailbox_iiIIi_v_+0x15ac
             imapd`__1cJcmd_fetch6FpnIimap_ctx_pc2i_v_+0x5e4
             imapd`__1cLprocess_cmd6FpnIimap_ctx_ipGpv_v2_i_+0xfa0
             imapd`firstcmdline+0xd2c
             libasock.so`GDispCx_Dispatch+0x1ac
             libasock.so`GDispCx_InternalWork+0x3ac
             libc.so.1`_lwp_start
            4921

and 1 of this:



             imapd`__1cLindex_check6FpnIimap_ctx_pnHmailbox_iiIIi_v_+0x1588
             imapd`__1cJcmd_fetch6FpnIimap_ctx_pc2i_v_+0x5e4
             imapd`__1cLprocess_cmd6FpnIimap_ctx_ipGpv_v2_i_+0xfa0
             imapd`firstcmdline+0xd2c
             libasock.so`GDispCx_Dispatch+0x1ac
             libasock.so`GDispCx_InternalWork+0x3ac
             libc.so.1`_lwp_start
            4608


The threads running on CPU in the imapd processes found to be at 3 locations in the index_check() function called while processing IMAP FETCH commands.

After checking the code we did not find a reason for the high number of index_check calls, so we created another DTrace script to inspect the calls to index_check.

The script imapd-index_check.d is attached to this Knowledge Document.

The output from this script, filtering for users/folders which occurred more than 10000 times, showed:

$ cat imapd-index_check.d.ms11.20140825.out | awk '$1 > 10000 {print}' | grep user  | sort -n
2014 Aug 25 07:50:29     Duration                   CPU time
      count      min      max      avg      min      max      avg folder
...
      20766        2      275        5        1      272        2 user/user3 (328448)
      22587        2      231        5        1      229        2 user/user3 (715776)
      23062        2      400        6        1      397        2 user/user2 (357424)
      27111        2      474        7        0      462        3 user/user1 (864383)
      54438        2      561        6        0      558        3 user/user1 (864128)  

It shows that user1 perform the index_check on one occasion 54438 times and on another (within same imapd process) 27111 times!

So what is this user/client doing? Checking the imap logs shows the following:

$ for i in `grep user1 imap | awk '{print substr($8,2,length($8)-2)}' | sort | uniq`
> do
> echo $i
> grep $i imap
> done

a.b.c.d:64364
[25/Aug/2014:08:00:30 +0900] myhost imapd[16808]: Account Information: connect [a.b.c.d:64364]
[25/Aug/2014:08:00:30 +0900] myhost imapd[16808]: Account Information: login [a.b.c.d:64364] user1 PLAIN 100834
[25/Aug/2014:08:00:30 +0900] myhost imapd[16808]: Account Information: select [a.b.c.d:64364] user1 user/user1
[25/Aug/2014:08:00:46 +0900] myhost imapd[16808]: Account Notice: close [a.b.c.d:64364] user1 2014/8/25 8:00:30 0:00:16 305781 789833 1

a.b.c.d:64366
[25/Aug/2014:08:00:30 +0900] myhost imapd[16789]: Account Information: connect [a.b.c.d:64366]
[25/Aug/2014:08:00:30 +0900] myhost imapd[16789]: Account Information: login [a.b.c.d:64366] user1 PLAIN 100999
[25/Aug/2014:08:00:30 +0900] myhost imapd[16789]: Account Notice: close [a.b.c.d:64366] user1 2014/8/25 8:00:30 0:00:00 165 728 0

a.b.c.d:64384
[25/Aug/2014:08:00:30 +0900] myhost imapd[16803]: Account Information: connect [a.b.c.d:64384]
[25/Aug/2014:08:00:30 +0900] myhost imapd[16803]: Account Information: login [a.b.c.d:64384] user1 PLAIN 100961
[25/Aug/2014:08:00:30 +0900] myhost imapd[16803]: Account Notice: close [a.b.c.d:64384] user1 2014/8/25 8:00:30 0:00:00 165 728 0

a.b.c.d:64393
[25/Aug/2014:08:00:30 +0900] myhost imapd[16806]: Account Information: connect [a.b.c.d:64393]
[25/Aug/2014:08:00:30 +0900] myhost imapd[16806]: Account Information: login [a.b.c.d:64393] user1 PLAIN 100925
[25/Aug/2014:08:00:30 +0900] myhost imapd[16806]: Account Information: select [a.b.c.d:64393] user1 user/user1
[25/Aug/2014:08:00:30 +0900] myhost imapd[16806]: Account Information: select [a.b.c.d:64393] user1 user/user1
[25/Aug/2014:08:00:30 +0900] myhost imapd[16806]: Account Notice: close [a.b.c.d:64393] user1 2014/8/25 8:00:30 0:00:00 153 1188 2

a.b.c.d:51433
[25/Aug/2014:08:00:46 +0900] myhost imapd[16800]: Account Information: connect [a.b.c.d:51433]
[25/Aug/2014:08:00:46 +0900] myhost imapd[16800]: Account Information: login [a.b.c.d:51433] user1 PLAIN 101032
[25/Aug/2014:08:00:46 +0900] myhost imapd[16800]: Account Information: select [a.b.c.d:51433] user1 user/user1
[25/Aug/2014:08:00:59 +0900] myhost imapd[16800]: Account Notice: close [a.b.c.d:51433] user1 2014/8/25 8:00:46 0:00:13 305868 790241 1

a.b.c.d:54630
[25/Aug/2014:08:00:59 +0900] myhost imapd[16794]: Account Information: connect [a.b.c.d:54630]
[25/Aug/2014:08:00:59 +0900] myhost imapd[16794]: Account Information: login [a.b.c.d:54630] user1 PLAIN 101198
[25/Aug/2014:08:00:59 +0900] myhost imapd[16794]: Account Information: select [a.b.c.d:54630] user1 user/user1
$

There were 4 sessions all started at 08:00:30.

The first did SELECT INBOX, both sent and received a lot of data, and ended 16 seconds later.
The second and third did not do SELECT, sent and received very little data, ended immediately.
The fourth did SELECT INBOX twice, also very little data, and ended in the same second.

The fifth session started at 08:00:46 and was very much like the first.
The sixth session started at 08:00:59 and we don't have log showing when it ended.

This looks like a specific client issue.
We need to look more in-depth by checking the telemetry of this user.

So what did we find in the telemetry files of last week and from today:

We focused on one user, just because he draws our attention, during an earlier data collection.

Background: Most IMAP selected-mode commands cause a call to index_check. Due to the way owner flag updates are handled in 7.0.5 relative to 7u4, this will be somewhat more CPU intensive (however, it's much less lock intensive, therefore less I/O, which allows more operations!). After optimization work, the bottleneck moves. In this case it moved from DB locks, which were artificially throttling, to CPU/paging effort.


When looking at the first telemetry file we see:

This client is very wasteful with fetch commands. Look at this:

A2 UID FETCH 843812 (UID)
A3 FETCH 7003 (BODYSTRUCTURE)
A4 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPLocale)])
A5 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (Subject)])
A6 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (Content-transfer-encoding)])
A7 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-Priority)])
A8 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPWebmail-Delivery-Method)])
A9 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPWebmail-Msg-Type)])
A10 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-MTR)])
A11 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (Message-id)])
A12 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-mySingle-SPAMMARK)])
A13 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPApproval-Locale)])
A14 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPHeader)])
A15 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPTrCode)])
A16 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPTrName)])
A17 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-MLAttribute)])
A18 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-RootMTR)])
A19 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-ParentMTR)])
A20 FETCH 7003 (ENVELOPE INTERNALDATE RFC822.SIZE)
A21 FETCH 7003 (FLAGS)
A22 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-USE-EXTERN-BODY)])
A23 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPExternalAttachSize)])
A24 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (From)])
A25 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-SENDER)])
A26 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-Mobile)])
A27 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-NonSamsung)])
A28 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (X-EPWebmail-Reply-Demand)])
A29 FETCH 7003 (BODY.PEEK[HEADER.FIELDS (Content-type)])
A30 FETCH 7003 (BODY[1.1]<0.12986>)
A31 FETCH 7003 (BODY.PEEK[1.1.MIME])
A32 FETCH 7003 (BODY.PEEK[1.2.MIME])
A33 FETCH 7003 (BODY.PEEK[1.3.MIME])
A34 FETCH 7003 (BODY.PEEK[1.4.MIME])
A35 FETCH 7003 (BODY.PEEK[1.5.MIME])
A36 FETCH 7003 (BODY.PEEK[2.MIME])

So the client just did 35 commands (and 35 round-trips) on one message that could have been done in 2 commands (2 round-trips).

When looking at a specific telemetry file:

Our server does no extra work if you try to set an already-set flag, so the client could replace these three commands:

A2 UID FETCH 843523 (UID)
A3 FETCH 6741 (FLAGS)
A4 STORE 6741 +FLAGS (UREPLY)

with one command:

A2 UID STORE 843523 +FLAGS (UREPLY)

And looking at another telemetry:

The client does not optimize for multiple messages, so we see:

A3 FETCH 1 (BODY.PEEK[HEADER.FIELDS (From)])
A4 FETCH 1 (BODY.PEEK[HEADER.FIELDS (Content-transfer-encoding)])
A5 FETCH 2 (BODY.PEEK[HEADER.FIELDS (From)])
A6 FETCH 2 (BODY.PEEK[HEADER.FIELDS (Content-transfer-encoding)])
...
A14053 FETCH 7005 (BODY.PEEK[HEADER.FIELDS (From)])
A14054 FETCH 7005 (BODY.PEEK[HEADER.FIELDS (Content-transfer-encoding)])

That is 14,052 commands that could be replaced with one command:

A3 FETCH 1:7005 (BODY.PEEK[HEADER.FIELDS (From Content-transfer-encoding)])

As because of the previously mentioned optimized way owner flag updates are handled in 7.0.5, this will be somewhat more CPU intensive.

The results of the data collection during a busy moment, are 3 times worse/heavier.

What does this mean?

The IMAP client the customer is using, makes many, very short lived IMAP connections.

Unlike normal IMAP usage, this client does CONNECT, LOGIN, FETCH, LOGIN in quick succession - like a POP client rather than an IMAP client.

Changes

Patching to Messaging Server 7u5.

Since the change of using the Solaris event ports instead of poll() for the main/dispatch thread, introduced in 7u5, we have removed the barriers of poll().
This solves the underlying problem for which the polldelay=1 was originally created.
So it is expected that 7u5 will be more responsive.

 

Cause

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Symptoms
Changes
Cause
Solution
References


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.