My Oracle Support Banner

DSR TCP Connections Didnot Fail Fast Enough When Far End Become Non-Responsive (Doc ID 2552930.1)

Last updated on DECEMBER 03, 2020

Applies to:

Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.3.0 and later
Tekelec

Goal

Diameter Watchdog Timer (DWR) value was decreased but the TCP connections did not fail fast enough. Some connections went down in time as defined by DWR timer but some connections remained up for a longer period of time.
Sequence of events summary:

Some of the connections took 20-30 seconds to become unavailable. Other connections remained degraded and became unavailable after more than 2 minutes.

Sequence of events of non-responsive peer (DSR as Responder) where the connections went down in between 20-34 seconds Details:

  1. (4 - 8 Seconds elapsed)
    Upon 1st Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received, DSR resets the timer and sends DWR to peer.
  2. (8 - 16 Seconds Elapsed)
    Upon 2nd consecutive Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received or DWA received, DSR resets the timer.
    • DSR Generate 8006 Event log: ‘Peer failed to send DWA before timeout’
  3. (12 - 24 Seconds Elapsed)
    Upon 3rd consecutive Diameter watchdog timer expiration (twinit timer of 6 seconds, +/- 2 secs), no Diameter traffic or peer DWR received or delayed DWA received:
    • DSR Generate 8006 Event log: ‘Peer failed transport watchdog’
    • DSR Generate 8005 Event log: ‘Operational state changed. Operational State=Degraded->Unavailable, Congestion Level=CL98->CL99, Congestion Source=TransportCongestion->State’
    • Begins the internal process of making the peer Diameter peer connection unavailable to be selected in the route list.
  4. (20 - 34 Seconds Elapsed: Additional approx. 8-10 Seconds)
    1. DSR has completed setting the Diameter peer connection as unavailable and DSR routes around the down Diameter Peer
      connection in the route list.
      22051: (Critical) Peer Unavailable
      22055: (Minor) Non-Preferred Route Group In Use
      22101: (Major) Connection Unavailable

Sequence of events of non-responsive peer (DSR as Responder) where the connections went down in after more than 2 minutes Details:

  1. Peer stops responding
  2. DSR still forwards request messages which get queued up in the diameter connection buffer and then in the TCP buffer
  3. Messages time out after 2 seconds in the diameter connection buffer and are cleared out
  4. TCP buffer is still full for 2 minutes 20 seconds
  5. Finally the connection decides its not working and clears out the TCP buffer
  6. The watchdog process kicks in taking the connection down

Based upon these sequence of events, please answer following questions:

  1. Why some connections went to CL3 and others to CL98?
  2. Determine if going directly into CL-98 congestion initially compared to CL-3 congestion initially was a factor for the delay of starting expected inactivity/twinit timer sequence of events?
  3. Would there be anything we can configure to make this failover faster?

Solution

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Goal
Solution


My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.