DSR TCP Connections Didnot Fail Fast Enough When Far End Become Non-Responsive (Doc ID 2552930.1)

Last updated on DECEMBER 04, 2023

Applies to:

Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.3.0 and later
Tekelec

Goal

Diameter Watchdog Timer (DWR) value was decreased but the TCP connections did not fail fast enough. Some connections went down in time as defined by DWR timer but some connections remained up for a longer period of time.
Sequence of events summary:

Peer end became unavailable
DSR detected lost of ingress Diameter messages from peer on TCP connections and marked connections as unavailable

Some of the connections took 20-30 seconds to become unavailable. Other connections remained degraded and became unavailable after more than 2 minutes.

Sequence of events of non-responsive peer (DSR as Responder) where the connections went down in between 20-34 seconds Details:

(4 - 8 Seconds elapsed)
Upon 1st Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received, DSR resets the timer and sends DWR to peer.
(8 - 16 Seconds Elapsed)
Upon 2nd consecutive Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received or DWA received, DSR resets the timer.
- DSR Generate 8006 Event log: ‘Peer failed to send DWA before timeout’
(12 - 24 Seconds Elapsed)
Upon 3rd consecutive Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received or delayed DWA received:
- DSR Generate 8006 Event log: ‘Peer failed transport watchdog’
- DSR Generate 8005 Event log: ‘Operational state changed. Operational State=Degraded->Unavailable, Congestion Level=CL98->CL99, Congestion Source=TransportCongestion->State’
- Begins the internal process of making the peer Diameter peer connection unavailable to be selected in the route list.
(20 - 34 Seconds Elapsed: Additional approx. 8-10 Seconds)
1. DSR has completed setting the Diameter peer connection as unavailable and DSR routes around the down Diameter Peer
  connection in the route list.
  22051: (Critical) Peer Unavailable
  22055: (Minor) Non-Preferred Route Group In Use
  22101: (Major) Connection Unavailable

Sequence of events of non-responsive peer (DSR as Responder) where the connections went down in after more than 2 minutes Details:

Peer stops responding
DSR still forwards request messages which get queued up in the diameter connection buffer and then in the TCP buffer
Messages time out after 2 seconds in the diameter connection buffer and are cleared out
TCP buffer is still full for 2 minutes 20 seconds
Finally the connection decides its not working and clears out the TCP buffer
The watchdog process kicks in taking the connection down

Based upon these sequence of events, please answer following questions:

Why some connections went to CL3 and others to CL98?
Determine if going directly into CL-98 congestion initially compared to CL-3 congestion initially was a factor for the delay of starting expected inactivity/twinit timer sequence of events?
Would there be anything we can configure to make this failover faster?

Solution

	To view full details, sign in with your My Oracle Support account.
	Don't have a My Oracle Support account? Click to get started!

In this Document

Goal

Solution

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.

DSR TCP Connections Didnot Fail Fast Enough When Far End Become Non-Responsive (Doc ID 2552930.1)

Applies to:

Goal

Solution

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!