[OpenSIPS-Devel] [ opensips-Bugs-3542814 ] Lockup when using dialog pings

SourceForge.net noreply at sourceforge.net
Wed Aug 8 16:44:49 CEST 2012


Bugs item #3542814, was opened at 2012-07-11 16:22
Message generated for change (Settings changed) made by bogdan_iancu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3542814&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: modules
Group: 1.8.x
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Ryan Bullock (rrb3942)
Assigned to: Vladut-Stefan Paiu (vladut-paiu)
Summary: Lockup when using dialog pings

Initial Comment:
After enabling dialog pings on a fairly busy system facing the caller (via create_dialog("PB"))  I have experienced a couple instances where opensips locks up and stops processing messages. During this time CPU usage also sky rockets.

At debug=3 nothing interesting seems to show up in the logs.

Restarting opensips leaves it operational for awhile, but can lockup again. Removing the "P" option from create_dialog stops the lockups all together.

ping_interval was set to 60 seconds, timeout_avp is used and set in the script.

Seems like there might be a deadlock/livelock issue with the dialog pings.

Opensips build information:
version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9084M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 14:07:51 Jun 13 2012 with gcc 4.4.6


Sorry if the details are light, but not much information was generated (logs stopped at the time of the lock as well), other than messages not being processed and high cpu usage.

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-31 14:57

Message:
Looks like that fixed it.

Put opensips under heavy load in the lab and it ran clean for 6 hours.
Normally I would get the deadlock almost immediately or after just a few
minutes of running traffic.

I will be able to test it in production next week as well, but this looks
good.

Thanks for all the help.

----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-07-31 06:21

Message:
Hello,

Can you please update to the latest 1.8 branch ?
I have just committed a fix for this. Let me know if the issue still
happens.

Regards,
Vlad

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-30 14:34

Message:
Ok, I think I see why this is happening.

get_timeout_dlgs() FIRST grabs the ping_timer lock and SECOND grabs the
lock on the current dialog.

When a timeout happens unref_dlg() FIRST grabs the lock on the current
dialog. remove_ping_timer() is then called and grabs the ping_timer lock
SECOND.

Therefore it is possible for one processes to hold the ping_timer lock and
be stuck trying to lock the current dialog, while another processes that
has already locked the current dialog (for teardown) and is attempting to
get the ping_timer lock.

Hope this helps.

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-30 14:12

Message:
Attached a full backtrace for the process doing get_timeout_dlgs() which
seems to be where the lock is occurring. Deadlock seems occur when this
function calls dlg_lock_dlg(current);

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-27 10:38

Message:
Hey Vlad,

Been doing some stress testing in a small test lab to get this to happen
and have had some good success.

I attached the log with the applied patch that you requested.

Since I can get this to happen in a lab now, I should be able to turn on
some more logging if you that would help.

Regards,

Ryan

----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-07-19 05:58

Message:
Hello Ryan,

Could you please try the attached patch ? It should give out some more
information about the dead-lock.
The patch prints some info to the logging facility, so when OpenSIPS dies
out, please send the log file.

Regards,
Vlad

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-18 09:56

Message:
Full bt added from a runaway process.

I updated opensips as well, new version information:

version: opensips 1.8.0-notls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, DISABLE_NAGLE, USE_MCAST, SHM_MEM,
SHM_MMAP, PKG_MALLOC, DBG_QM_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16,
MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: 2:9141M
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 13:06:46 Jul 13 2012 with gcc 4.4.6

----------------------------------------------------------------------

Comment By: Ryan Bullock (rrb3942)
Date: 2012-07-13 13:35

Message:
Probably not useful, got a bt from a process but unfortunately gdb couldn't
load the debug symbols for opensips. Trying to figure why that is and will
try to catch it again and provide a better backtrace.

----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-07-12 01:44

Message:
Hello,

When OpenSIPS get stuck again in 100% CPU, can you please get an OpenSIPS
PID that's stuck and do
      gdb [path_to_opensips_binary] [pid]
inside gdb do
      bt full
and paste here the output so I can further debug this.

Regards,
Vlad

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3542814&group_id=232389



More information about the Devel mailing list