High Number of LWPs in AService or Imapd Processes (Doc ID 1346043.1)

Last updated on SEPTEMBER 14, 2016

Applies to:

Oracle Communications Messaging Server - Version 5.2.0 and later
Information in this document applies to any platform.

Symptoms

The AService process ran at maxthreads (250) all day. Normally it is only around 40.

Response time for all traffic through the MMP may begin to increase.

The amount of CPU used by the AService process (as shown by prstat or top) increases and vmstat shows it mostly system (kernel) mode CPU rather than user mode.

A prstat -L shows LWP 1 of the AService process is consuming a percentage which is equal to 100% of one processor.

For example

On a T5220 which has 1 CPU, with 8 cores, and 8 threads per core = 64 virtual processors (ie. psrinfo | wc -l == 64) with each zone running MMP handling about 10,000 simultaneous connections, on one of those zones:

prstat

  PID USERNAME SIZE  RSS STATE PRI NICE      TIME  CPU  PROCESS/NLWP
 3926 mailsrv  487M 479M cpu28   0    0 318:59:30  3.2% AService/102
19519 root      38M  21M sleep  59    0  28:21:42  0.2% nscd/36
  .
  .
Total: 42 processes, 348 lwps, load averages: 19.33, 18.70, 18.53


Note that although the top process is AService and it is only using 3.2% of the CPU, the load average is nearly 20.  Remember that the CPU% shown by prstat will be the percentage of all CPU available on the system.  If you multiply that percentage by the number of (virtual) CPUs, you get a representation of the number of individual CPUs being used by that process.  In this case 3.2% x 64 = 204% or slightly more than 2 of the 64 virtual processors on this system.

prstat -L

  PID USERNAME SIZE  RSS STATE PRI NICE      TIME   CPU PROCESS/LWPID
 3926 mailsrv  487M 479M cpu19   0    0 232:43:14  1.4% AService/1
 3926 mailsrv  487M 479M sleep  59    0   0:00:02  0.1% AService/21180
 3926 mailsrv  487M 479M sleep  59    0   0:00:26  0.0% AService/21071
 3926 mailsrv  487M 479M sleep  59    0   0:00:07  0.0% AService/21150
 3926 mailsrv  487M 479M sleep  50    0   0:00:19  0.0% AService/21083
   .
   .
Total: 47 processes, 356 lwps, load averages: 19.41, 18.83, 18.58


This shows that the main/dispatch thread in the AService process is using nearly all of a single (virtual) processor and the worker threads seem completely idle (at least from the average view shown by prstat).

vmstat 5 10

The above prstat and prstat -L commands were run in one of the local zones.  The following vmstat output was from the global zone.  But all the zones are running MMP and behaving basically the same.

kthr      memory             page            disk          faults      cpu
r b w   swap  free   re  mf pi po fr de sr s0 s1 s2 s3   in  sy    cs us sy id
  <ignore the first line of vmstat I C output because it is long term average>
6 0 0 27055032 1976080 348 726 13 0 0 0 0  0  1   0 1 18522 46353 30178 5 24 71
9 0 0 27054560 1975488 283 558 0 2 2 0  0  0  1  0  1 20730 48674 35131 5 25 70
5 0 0 27056120 1978344 21 76 0 0  0  0  0  0  0  0  0 19051 41152 32728 3 24 73
4 0 0 27055416 1977736 21 64 0 0  0  0  0 12  0  0  0 18085 37345 30102 3 23 74
4 0 0 27055432 1977720 32 98 0 0  0  0  0  1 15  0 14 17746 33212 28510 3 24 73
15 0 0 27055256 1977512 21 69 0 0 0  0  0  0  0  0  0 26016 54643 44590 5 25 71
2 0 0 27054392 1976600 20 35 0 0  0  0  0  0  0  0  0 13929 24426 21641 2 22 76
5 0 0 27053960 1976136 11 34 0 0  0  0  0  0  0  0  0 21161 50969 34530 5 28 67
4 0 0 27053184 1975360 21 67 0 0  0  0  0  0  0  0  0 17412 32466 28145 3 22 76


Note that although there is always more than 70% idle CPU (the right-most column), there are always several threads waiting for CPU (the 'r' column).

Also notice the system mode CPU (the 'sy' column) is 5 to 10 times higher than the user mode ('us' column).

top on Linux

The top command shows the AService process is using the most CPU and it is all in system mode.

Note that other calls may happen more often, but it is poll in main/dispatch thread which is using all the CPU and that top shows it is all in system mode.

 

Difference between imapd and AService

This problem can occur in the imapd process going back to 5.0 (and probably Netscape Messaging Server 4.x).

It would not occur in the AService process (or not look exactly like above) until Messaging Server 7 update 2, when the MMP was changed to use synchronous processing for some functions, such as DNS lookups and initiating connections to the backend store systems.  That change did not cause the problem.  The same scalability issue has existed in both imapd and AService all along, but the symptoms in the MMP would not look the same as in imapd until after 7u2.

Changes

The problem may appear following an outage.  For example, if all the backend message store systems are restarted at the same time, this would cause all of the clients to try to reconnect.  The load of roughly simultaneous connection attempts creates a lot more load through the MMP, which may cause it to exhibit these symptoms.  The same scenario could happen after a network outage is corrected.  (Also see Doc ID 1277288.1)

Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms