[OpenSIPS-Devel] [ opensips-Bugs-3570495 ] crashes in the dialog module with opensips 1.8.1-tls
SourceForge.net
noreply at sourceforge.net
Tue Oct 2 15:30:26 CEST 2012
Bugs item #3570495, was opened at 2012-09-21 05:26
Message generated for change (Comment added) made by dragosoancea
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3570495&group_id=232389
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Dragos Oancea (dragosoancea)
Assigned to: Nobody/Anonymous (nobody)
Summary: crashes in the dialog module with opensips 1.8.1-tls
Initial Comment:
Hi ,
The setup where I am experiencing these crashes has the following particularities:
1. TLS
My TLS config looks like this:
listen = tls:212.162.x.x:5061
listen = tcp:172.20.x.x:5060
# tls params
disable_tls = no
tls_verify_server = 1
tls_verify_client = 0
tls_require_client_certificate = 0
tls_method = TLSv1
tls_certificate = "/etc/pki/CA/certs/mydomain.crt"
tls_private_key = "/etc/pki/CA/private/mydomain.key"
tls_ca_list = "/etc/pki/CA/certs/ca.crt"
# tcp parameters
tcp_connection_lifetime=1860
open_files_limit=8192
tcp_max_connections=4096
2. SIP ping for NAT bypassing
I create the dialog with
create_dialog("BPp") , sending OPTIONS to both sides .
I also call fix_nated_contact() on requests and replies.
3. I have calls where a protocol change is necessary , like UDP->TLS , TLS->UDP , TLS->TCP .
I use:
modparam("rr", "enable_double_rr", 1)
4. I use the same database for multiple SIP proxies.
Unfortunally opensips was not compiled with debugging symbols.
Here 1st crash ( I saw this one ALOT of times) :
Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) bt full
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#2 0x000000000042a814 in main ()
No symbol table info available.
Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) i r
rax 0x7f15e430a388 139731999433608
rbx 0x1 1
rcx 0x7f15e43020a0 139731999400096
rdx 0x400 1024
rsi 0x1 1
rdi 0x200000009 8589934601
rbp 0x7f15e4792570 0x7f15e4792570
rsp 0x7fff372daaa0 0x7fff372daaa0
r8 0x7f15e4792558 139732004185432
r9 0x6 6
r10 0x4000 16384
r11 0x0 0
r12 0x7f15e47926e8 139732004185832
r13 0x200000009 8589934601
r14 0x769660 7771744
r15 0x734cc8 7556296
rip 0x7f160948c7e8 0x7f160948c7e8 <dlg_ping_routine+280>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) bt
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
#1 0x000000000047d4ca in start_timer_processes ()
#2 0x000000000042a814 in main ()
(gdb) bt full
#0 0x00007f160948c7e8 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#2 0x000000000042a814 in main ()
No symbol table info available.
(gdb)
this crash happends from time to time , unrelated to the load apparently.
machine configuration:
===================
Centos 6.3
Linux somedomain.com 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64 x86_64 x86_64 GNU/Linux
opensips -V
version: opensips 1.8.1-tls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: unknown
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 17:04:54 Sep 8 2012 with gcc 4.4.6
openssl version:
openssl-1.0.0-20.el6_2.5.x86_64
======================
-------------------------------------------------------------------
-------------------------------------------------------------------
Here is the 2nd crash - it could be related to the first. (this is on another box, with a
slightly different config file (it listens to UDP as well)
I did not see this showing like below in gdb on opensips 1.7.2 (it never used to crash in unref_dlg , only in dlg_ping_routing())
Core was generated by `/usr/sbin/opensips -P /var/run/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fe93ab05491 in unref_dlg () from /usr/lib64/opensips/modules/dialog.so
Missing separate debuginfos, use: debuginfo-install opensips-1.8.1-3.el6.x86_64
(gdb) bt full
#0 0x00007fe93ab05491 in unref_dlg () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#1 0x00007fe93ab0e7f5 in dlg_ping_routine () from /usr/lib64/opensips/modules/dialog.so
No symbol table info available.
#2 0x000000000047d4ca in start_timer_processes ()
No symbol table info available.
#3 0x000000000042a814 in main ()
No symbol table info available.
machine configuration (almost the same config as 2st crash):
===============
Centos 6.3
Linux somedomain.com 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
opensips -V
version: opensips 1.8.1-tls (x86_64/linux)
flags: STATS: Off, USE_IPV6, USE_TCP, USE_TLS, DISABLE_NAGLE, USE_MCAST, SHM_MEM, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
svnrevision: unknown
@(#) $Id: main.c 8772 2012-03-08 11:16:13Z bogdan_iancu $
main.c compiled on 17:04:54 Sep 8 2012 with gcc 4.4.6
opessl version:
openssl-1.0.0-20.el6_2.5.x86_64
================
syslog of the crashed process:
http://pastebin.com/F3hu1WY5
please notice something that looks like a memory leak in opensips:
CRITICAL:dialog:unref_dlg: bogus ref -1788757169 with cnt 1 for dlg 0x7fe915eccd30 [0:0] with clid '' and tags 'EDH-RSA-DES-CBC3-SHA' ''
tags "EDH-RSA-DES-CBC3-SHA" ? these are ssl ciphers. how could they be related to the dialog module ?
One particular thing related to the second bug is that just before it crashed, opensips was showing alot of messages like this,
coming everytime from the same remote unknown client , but with different source port:
ERROR:core:tls_accept: New TLS connection from 82.113.x.x:43574 failed to accept: rejected by client
ERROR:core:tls_accept: New TLS connection from 82.113.x.x:61586 failed to accept: rejected by client
I have core files, let me know if I can upload them somewhere.
Cheers,
Dragos
----------------------------------------------------------------------
>Comment By: Dragos Oancea (dragosoancea)
Date: 2012-10-02 06:30
Message:
Hi
Yes, both patches were applied at the same time.
The only thing changed is the ping_timeout option which was 20 seconds, now
is 40 seconds.
But I am not sure this is relevant.
It is looking good so far. I had no crash but I also had to restart
opensips yesterday because I made a script change to disable tcp keepalive
which was draining out the battery of our mobile devices.
I also noticed that tcp keepalive is enabled by default in opensips 1.8.1 ,
the tcp socket is created with the SO_KEEPALIVE option regardless of the
state of the tcp_keepalive switch of the script.
I will keep you updated.
Cheers,
Dragos Oancea
----------------------------------------------------------------------
Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-10-01 05:45
Message:
Hello,
Ok,cool. Is this with both patches applied ( dlg patch + TLS patch ) ?
Please keep us up to date if things remain stable in time.
Regards,
Vlad
----------------------------------------------------------------------
Comment By: Dragos Oancea (dragosoancea)
Date: 2012-10-01 05:00
Message:
Hi
Thank you for the feedback.
I applied the patch three days ago. Looks good so far.
Cheers,
Dragos
----------------------------------------------------------------------
Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-09-28 00:42
Message:
Hello,
Also, for the TLS memory corruption, can you please try the patch for this
trunk revision :
http://opensips.svn.sourceforge.net/viewvc/opensips?view=revision&revision=9268
It added support for proper locking within the openssl library. It's not
yet back-ported to the 1.8 branch, as we felt it needed a little bit more
testing.
Regards,
Vlad
----------------------------------------------------------------------
Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2012-09-28 00:32
Message:
Hello,
Can you apply the attached patch, and see if you still experience these
issues ?
Thanks and Regards,
Vlad
----------------------------------------------------------------------
Comment By: Dragos Oancea (dragosoancea)
Date: 2012-09-27 07:55
Message:
Hi
It happened again.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fe93ab05491 in unlink_unsafe_dlg (dlg=0x7fe915eccd30, cnt=1) at
dlg_hash.c:706
706 dlg->next->prev = dlg->prev;
(gdb) bt full
#0 0x00007fe93ab05491 in unlink_unsafe_dlg (dlg=0x7fe915eccd30, cnt=1) at
dlg_h
ash.c:706
No locals.
#1 unref_dlg (dlg=0x7fe915eccd30, cnt=1) at dlg_hash.c:740
d_entry = 0x7fe915c3f990
__FUNCTION__ = "unref_dlg"
#2 0x00007fe93ab0e7f5 in dlg_ping_routine (ticks=<value optimized out>,
attr=<v
alue optimized out>) at dlg_timer.c:529
expired = <value optimized out>
it = 0x0
curr = 0x0
dlg = 0x7fe915eccd30
__FUNCTION__ = "dlg_ping_routine"
#3 0x000000000047d4ca in timer_ticker () at timer.c:355
t = 0x7fe93c19c810
#4 run_timer_process () at timer.c:425
multiple = 10
cnt = <value optimized out>
tv = {tv_sec = 0, tv_usec = 0}
#5 start_timer_processes () at timer.c:522
tpl = 0x7fe93c20aa60
pid = <value optimized out>
__FUNCTION__ = "start_timer_processes"
#6 0x000000000042a814 in main_loop (argc=<value optimized out>,
argv=<value optimized out>) at main.c:933
i = <value optimized out>
pid = <value optimized out>
si = 0x0
startup_done = 0x0
chd_rank = 0
rc = <value optimized out>
load_p = 0x0
#7 main (argc=<value optimized out>, argv=<value optimized out>) at
main.c:1520
cfg_log_stderr = <value optimized out>
cfg_stream = <value optimized out>
c = <value optimized out>
r = <value optimized out>
tmp = 0x1 <Address 0x1 out of bounds>
tmp_len = <value optimized out>
port = <value optimized out>
proto = <value optimized out>
options = 0x4eb5a8 "f:cCm:M:b:l:n:N:rRvdDETSVhw:t:u:g:P:G:W:o:"
ret = -1
seed = 1183817865
rfd = <value optimized out>
__FUNCTION__ = "main"
This function is called when Opensips would try to end the dialog sending
BYE to both sides, due to the fact it did not receive a reply to OPTIONS
request that it's sending.
These are the comments of a good samaritan on IRC (freenode / #opensips) ,
in regard to this bug :
.......
> how many process have the run_timer_process ?
> Only one or multiple ?
> the dialog may have been deleted after get_timeout_dlgs() which hold the
lock for lock_get(ping_timer->lock);
> OR
> /* no longer reffed in list */
> unref_dlg(dlg,1);
.......
> how many process have the run_timer_process ?
> Only one or multiple ?
> the dialog may have been deleted after get_timeout_dlgs() which hold the
lock for lock_get(ping_timer->lock);
> OR
> /* no longer reffed in list */
> unref_dlg(dlg,1);
> /* dlg is still reffed in TM callback. deletion from memory
> * will happen only on 408 timeout */
> so that if you ping a leg
> and there is no answer
> and while you called get_timeout_dlgs() you received a 408
> the dialog may be destroyed
> if it happens randomly when a pinged leg doesn't answer then it might be
related to a race condition with the 408 (or any other path that destroy
the dialog)
> (the dialog extracted from the ping list)
..........
PS: Thank you for the comments, these might really help!
Cheers,
Dragos
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3570495&group_id=232389
More information about the Devel
mailing list