[OpenSIPS-Devel] [ opensips-Bugs-3585606 ] TCP Deadlock

Tue Dec 4 00:33:19 CET 2012

Bugs item #3585606, was opened at 2012-11-08 20:47
Message generated for change (Comment added) made by dmsanders
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3585606&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: core
Group: 1.8.x
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: David Sanders (dmsanders)
Assigned to: Bogdan-Andrei Iancu (bogdan_iancu)
Summary: TCP Deadlock

Initial Comment:
There is a serious deadlock issue when using TCP with OpenSIPS (1.8.0-tls). I found this paper which has the same conclusion (but is discussing OpenSER circa 2008): http://www.cs.rice.edu/CS/Architecture/docs/ram-ispass08.pdf

I'll quote the relevant part of Section 6:

This can lead to deadlock in the following situation. When a
worker process requests a connection from
the supervisor process, it then blocks waiting to receive that
ﬁle descriptor. If, at the same time, the supervisor process
blocks waiting to send a new connection to the same worker
(since the buffer at the receiver is full), the two processes
will deadlock. Once the supervisor process deadlocks, no
other worker can make progress either, as they will quickly
need their own connections from the supervisor process.
Similarly, no new connections will be accepted. This clearly
illustrates that in an event-driven server, one must be careful
to only read from sockets when the event mechanism says
there is something to read and only write to sockets when
the event mechanism says there is space to write.

I can reliably reproduce this deadlock with any number of TCP children. Interestingly it seems to happen faster with a larger number of children. Under constant load, once the main TCP process deadlocks, all the children will as well.

It seems to be rate related. Using SIPp to drive TCP traffic to an OpenSIPS server, 50 registers/second do not encounter the deadlock issue. However, if increase the traffic load a deadlock will occur within 30 seconds. My theory is that if the TCP children can't process a message and reply faster than they are coming in (in this case faster than 20ms) then the deadlock will occur.

For completeness the GDB backtrace output of the deadlocked processes when running two TCP children are attached.

----------------------------------------------------------------------

>Comment By: David Sanders (dmsanders)
Date: 2012-12-03 15:33

Message:
Until this can be fixed, I've found that it can be minimized by tweaking
"net.unix.max_dgram_qlen" in sysctl. This defaults to 10, and I've seen the
issue greatly reduced by increasing it to 100.

----------------------------------------------------------------------

Comment By: David Sanders (dmsanders)
Date: 2012-11-23 22:21

Message:
Hi Bogdan,

Is there any news on this?

Could you give some kind of prediction on when this could be fixed by?
Before the end of the year, or not until 1.9 or?

Any information on a timeline would help me figure out how to proceed with
my project in a timely manner.

Thanks,
- David 

----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2012-11-09 09:40

Message:
Hi David - thank you for the report - I will look into it asap !

Regards,
Bogdan

----------------------------------------------------------------------

Comment By: David Sanders (dmsanders)
Date: 2012-11-08 20:49

Message:
I took the liberty of upgrading this to a higher priority bug since it can
completely deadlock TCP traffic for a server if the call rate gets too
high.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3585606&group_id=232389