Multithreaded Server Freeze In Case Of Blocking tpacall() Under High Load (Doc ID 1332540.1)

Last updated on NOVEMBER 04, 2016

Applies to:

Oracle Tuxedo - Version 8.1 and later
Information in this document applies to any platform.

Symptoms


This is a Tuxedo application with a multi-threaded server advertising several services and getting a huge number of incoming requests.

There are asynchronous requests without any expected replies :

tpacall(Service_Name, buffer, ..., TPNOREPLY)


When the incoming message queue of the server is filled up, the mechanism of using a temporary file on its behalf is used.

If a "handling" thread fails to send a message, several tries are done to  resend the message. Different latency periods are used between the different tries.

Up to 8 times, the thread wanting to send the message sleeps 1 second between each attempt to send a message and then sleeps 4 seconds 2 times between each attempt before managing to send the message.

Such a mechanism of temporization for this "handling" thread is not by  itself a problem.

The problem is : the thread "dedicated" to dequeue the awaiting requests in the incoming message queue remains blocked MORE THAN 16 seconds !

This "dedicated" thread is blocked due to a mutex lock awaiting that the "handling" request releases it. And the consequence is : All of the other threads remain idle instead of handling the awaiting requests.

The following test case underscores the problem :

There is a multi-threaded server advertising 3 services (SVC1, SVC2 and SVC3):

  • SVC1 service : within a loop (item number can be adjusted) SVC2 service is asynchronously called.
  • SVC2 service : just returning the incoming buffer.
  • SVC3 service : invoked by ud32 client utility to asynchronously call SVC1 service, an adjustable number of times.

ud32 utility is used to call SVC3 service and to pass the three following adjustable parameters :

- number of asynchronous calls to SVC1 (by default, 50)
- number of asynchronous calls to SVC2 done within SVC1 service (by default, 20000)
- size of buffer returned by SVC2 service (by default, a 256-byte buffer)

The by default parameters may be changed to more easily exhibit the problem.

All of system call done by the server advertising SVC1, SVC2 and SVC3 services must be traced :
strace -f -tt -T -o strace.txt -p <PID of SRV>
Launch ud32 utility, then wait several minutes and/or check under tmadmin utility the awaiting requests :



> pq
Prog Name      Queue Name # Serve  Wk Queued  # Queued  Ave. Len    Machine
---------      -------------------  ---------  -------- --------    -------
BBL            55430             1         0         0       0.0      SITE1
SRV            SRV333            1       650        13      15.3      SITE1

Then, check if a long delay before acquiring a mutex is reported for "main" PID :

grep "futex resumed> ) = 0 <16" strace.txt

e.g., result is : 20740 11:21:54.131833 <... futex resumed> ) = 0 <16.020905>


Piece of strace.txt file gotten with execution of the test case :

From line 7712 until line 8025 :
20740 is the "main" thread identifier.
20805 is the "handling service" thread identifier.


"Main" thread awaiting a mutex :
Line 7712: 20740 11:21:38.110912 futex(0x2ba62b4af720, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
"Main" thread acquiring the mutex 16 seconds later :
Line 8018: 20740 11:21:54.131833 <... futex resumed> ) = 0 <16.020905>

In between, for "handling service" thread (pid 20805)
Sending a message fails. Error is EAGAIN
Line 7713: 20805 11:21:38.110940 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000098>

Use of a temporary file :
Line 7731: 20805 11:21:38.111463 open("/tmp/TUXJ7CJNO1764817216", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>

Sending a message fails. Error is EAGAIN
Line 7745: 20805 11:21:38.111939 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000049>

8 times the sequence "sleep 1 second/try to send a message" :
Line 7767: 20805 11:21:39.113258 <... nanosleep resumed> {1, 0}) = 0 <1.000889>
Line 7771: 20805 11:21:39.113409 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000050>

Line 7793: 20805 11:21:40.115253 <... nanosleep resumed> {1, 0}) = 0 <1.001424>
Line 7797: 20805 11:21:40.115402 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000051>

Line 7819: 20805 11:21:41.117256 <... nanosleep resumed> {1, 0}) = 0 <1.001410>
Line 7823: 20805 11:21:41.117410 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000051>

Line 7845: 20805 11:21:42.119260 <... nanosleep resumed> {1, 0}) = 0 <1.001430>
Line 7849: 20805 11:21:42.119434 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000050>

Line 7871: 20805 11:21:43.121262 <... nanosleep resumed> {1, 0}) = 0 <1.001405>
Line 7875: 20805 11:21:43.121414 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000051>

Line 7897: 20805 11:21:44.123267 <... nanosleep resumed> {1, 0}) = 0 <1.001431>
Line 7901: 20805 11:21:44.123426 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000051>

Line 7923: 20805 11:21:45.125258 <... nanosleep resumed> {1, 0}) = 0 <1.001407>
Line 7927: 20805 11:21:45.125409 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000050>

Line 7949: 20805 11:21:46.127257 <... nanosleep resumed> {1, 0}) = 0 <1.001427>
Line 7953: 20805 11:21:46.127403 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000050>


2 times the sequence "sleep 4 seconds/try to send a message"
Line 7975: 20805 11:21:50.129262 <... nanosleep resumed> {4, 0}) = 0 <4.001437>
Line 7979: 20805 11:21:50.129411 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000050>

Line 8004: 20805 11:21:54.131280 <... nanosleep resumed> {4, 0}) = 0 <4.001422>
Line 8008: 20805 11:21:54.131442 <... msgsnd resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000051>

Then remove the temporary file :
Line 8010: 20805 11:21:54.131505 unlink("/tmp/TUXJ7CJNO1764817216" <unfinished ...>
Line 8012: 20805 11:21:54.131634 <... unlink resumed> ) = 0 <0.000107>

Then "handling service" thread release the futux awaited by "main" thread :
Line 8014: 20805 11:21:54.131714 futex(0x2ba62b4af720, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
Line 8016: 20805 11:21:54.131785 <... futex resumed> ) = 1 <0.000054>

"handling service" thread reports an error message :
Line 8025: 20805 11:21:54.132045 write(2, "SVC1: tpacall failed: TPETIME - "..., 48 <unfinished ...>

"Handling service" thread fails to acquire another mutex :
Line 8019: 20805 11:21:54.131882 futex(0x37ad752990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
Line 8022: 20805 11:21:54.131963 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) <0.000065>





Cause

Sign In with your My Oracle Support account

Don't have a My Oracle Support account? Click to get started

My Oracle Support provides customers with access to over a
Million Knowledge Articles and hundreds of Community platforms