[OpenSIPS-Devel] [ opensips-Bugs-3182321 ] segfault in f_malloc.c, chains from dialog module

SourceForge.net noreply at sourceforge.net
Tue Jun 28 10:12:23 CEST 2011


Bugs item #3182321, was opened at 2011-02-15 16:54
Message generated for change (Comment added) made by bogdan_iancu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3182321&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: modules
Group: 1.6.x
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Robert Smith (denodaeus)
Assigned to: Bogdan-Andrei Iancu (bogdan_iancu)
Summary: segfault in f_malloc.c, chains from dialog module

Initial Comment:
We've had several crashes now, R7081 (this is pretty much 1.6.3 pulled the day of release with a couple of bug fix patches in textops and loadbalancer applied):

It seems to be fairly consistently happening around free_dlg_dlg in dlg_hash.c which is triggering it -- we have several cores showing this behavior.

What are the next steps for debugging this?  This is on a production machine and we haven't been able to reproduce it elsewhere, so we are somewhat sensitive to load issues while trying to troubleshoot this.

Thanks.



Program terminated with signal 11, Segmentation fault.
#0  0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
155			(*f)->prev = &(frag->u.nxt_free);
(gdb) bt full
#0  0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
        f = 0x2b875eba4070
        hash = 3
#1  fm_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:460
        f = 0x2b875f41be40
        n = <value optimized out>
        __FUNCTION__ = "fm_free"
#2  0x00002b875e14dcde in free_dlg_dlg () at dlg_hash.c:168
        i = 0
#3  destroy_dlg_table () at dlg_hash.c:234
        dlg = 0x2b875f319a80
        i = 5
#4  0x00002b875e13a651 in mod_destroy () at dialog.c:742
No locals.
#5  0x0000000000476ff4 in destroy_modules () at sr_module.c:370
        t = 0x785328
        foo = 0x785258
#6  0x00000000004252e0 in cleanup (show_status=1) at main.c:336
No locals.
#7  0x00000000004261fb in handle_sigs () at main.c:533
        chld = 0
        chld_status = 139
        i = <value optimized out>
        do_exit = 1
        __FUNCTION__ = "handle_sigs"
#8  0x000000000042a5a9 in main_loop (argc=3, argv=0x4) at main.c:913
        i = 4
        pid = <value optimized out>
        si = 0x0
        startup_done = 0x0
        chd_rank = 16
        __FUNCTION__ = "main_loop"
#9  main (argc=3, argv=0x4) at main.c:1388
        cfg_log_stderr = <value optimized out>
        cfg_stream = 0x831e010
        c = <value optimized out>
        r = <value optimized out>
        tmp = 0x4fc447 "H\215\005\262\333#"
        tmp_len = <value optimized out>
        port = <value optimized out>
        proto = <value optimized out>
        ret = <value optimized out>
        seed = 586369539
        rfd = <value optimized out>
        __FUNCTION__ = "main"
(gdb) 


(gdb) frame 0
#0  0x00000000004a2bcf in fm_insert_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:155
155			(*f)->prev = &(frag->u.nxt_free);
(gdb) list
150		
151		/*insert it here*/
152		frag->prev = f;
153		frag->u.nxt_free=*f;
154		if( *f )
155			(*f)->prev = &(frag->u.nxt_free);
156	
157		*f=frag;
158		qm->free_hash[hash].no++;
159		free_plus(qm , frag->size);
(gdb) info locals
f = 0x2b875eba4070
hash = 3
(gdb) print *f
$1 = (struct fm_frag *) 0x2b865f235290
(gdb) print *(*f)
Cannot access memory at address 0x2b865f235290
(gdb) frame 1
#1  fm_free (qm=0x2b875eba4000, p=<value optimized out>) at mem/f_malloc.c:460
460		fm_insert_free(qm, f);
(gdb) print f
$2 = (struct fm_frag *) 0x2b875f41be40
(gdb) list
455			goto join;
456		}
457	
458	no_join:
459	
460		fm_insert_free(qm, f);
461	}
462	
463	
464	#ifdef DBG_F_MALLOC
(gdb) 


And a second crash here a few hours after restarting from above:

Core was generated by `/usr/local/opensips/sbin/opensips -u opensips'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004a2bcf in fm_insert_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:155
155			(*f)->prev = &(frag->u.nxt_free);
(gdb) bt full
#0  0x00000000004a2bcf in fm_insert_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:155
        f = 0x2aca30c14070
        hash = 3
#1  fm_free (qm=0x2aca30c14000, p=<value optimized out>) at mem/f_malloc.c:460
        f = 0x2aca315c1718
        n = <value optimized out>
        __FUNCTION__ = "fm_free"
#2  0x00002aca301bdcde in free_dlg_dlg () at dlg_hash.c:168
        i = 0
#3  destroy_dlg_table () at dlg_hash.c:234
        dlg = 0x2aca31692c70
        i = 0
#4  0x00002aca301aa651 in mod_destroy () at dialog.c:742
No locals.
#5  0x0000000000476ff4 in destroy_modules () at sr_module.c:370
        t = 0x785328
        foo = 0x785258
#6  0x00000000004252e0 in cleanup (show_status=1) at main.c:336
No locals.
#7  0x00000000004261fb in handle_sigs () at main.c:533
        chld = 0
        chld_status = 139
        i = <value optimized out>
        do_exit = 1
        __FUNCTION__ = "handle_sigs"
#8  0x000000000042a5a9 in main_loop (argc=3, argv=0x4) at main.c:913
        i = 4
        pid = <value optimized out>
        si = 0x0
        startup_done = 0x0
        chd_rank = 16
        __FUNCTION__ = "main_loop"
#9  main (argc=3, argv=0x4) at main.c:1388
        cfg_log_stderr = <value optimized out>
        cfg_stream = 0x14cfb010
        c = <value optimized out>
        r = <value optimized out>
        tmp = 0x4fc447 "H\215\005\262\333#"


----------------------------------------------------------------------

>Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2011-06-28 11:12

Message:
Hi Robert,

Do you still experience the crash with the latest opensips version ?

Regards,
Bogdan

----------------------------------------------------------------------

Comment By: Robert Smith (denodaeus)
Date: 2011-03-04 17:05

Message:
(gdb) bt
#0  fm_remove_free (qm=0x2ade7185d000, p=0x2ade71fd2450, size=<value
optimized out>) at mem/f_malloc.c:175
#1  fm_realloc (qm=0x2ade7185d000, p=0x2ade71fd2450, size=<value optimized
out>) at mem/f_malloc.c:530
#2  0x00002ade70dee4b4 in shm_realloc (dlg=0x2ade722260c8,
tag=0x7fffbd4d4240, rr=0x7fffbd4d40f0, contact=0x7fffbd4d4100,
cseq=0x7fffbd4d4110, sock=0x784710) at ../../mem/shm_mem.h:212
#3  dlg_add_leg_info (dlg=0x2ade722260c8, tag=0x7fffbd4d4240,
rr=0x7fffbd4d40f0, contact=0x7fffbd4d4100, cseq=0x7fffbd4d4110,
sock=0x784710) at dlg_hash.c:309
#4  0x00002ade70de472c in init_leg_info (dlg=0x2ade722260c8, msg=0x7a3968,
t=<value optimized out>, tag=0x7fffbd4d4240) at dlg_handlers.c:242
#5  0x00002ade70de9a6b in push_reply_in_dialog (t=0x2ade7205c1d8,
type=<value optimized out>, param=<value optimized out>) at
dlg_handlers.c:297
#6  dlg_onreply (t=0x2ade7205c1d8, type=<value optimized out>,
param=<value optimized out>) at dlg_handlers.c:342
#7  0x00002ade6f4ed8db in run_trans_callbacks (type=16,
trans=0x2ade7205c1d8, req=0x2ade71ed8ac0, rpl=0x7a3968, code=<value
optimized out>) at t_hooks.c:208
#8  0x00002ade6f508107 in relay_reply (t=0x2ade7205c1d8, p_msg=0x7a3968,
branch=<value optimized out>, msg_status=183, cancel_bitmap=0x7fffbd4d44e8)
at t_reply.c:1147
#9  0x00002ade6f508990 in reply_received (p_msg=0x7a3968) at
t_reply.c:1494
#10 0x0000000000421e98 in forward_reply (msg=0x7a3968) at forward.c:561
#11 0x00000000004570c2 in receive_msg (
    buf=0x758f60 "XXXXXXXXXXX"..., len=1208, rcv_info=0x7fffbd4d45d0) at
receive.c:200
#12 0x000000000049be14 in udp_rcv_loop () at udp_server.c:492
#13 0x000000000042a65d in main_loop (argc=3, argv=<value optimized out>)
at main.c:818
#14 main (argc=3, argv=<value optimized out>) at main.c:1388


In this circumstance, it looks like in dlg_add_leg_info, the callID in
dialog is getting overwritten:

print dlg->callid.s
$10 = 0x2ade72226178
"d71c112bbc80838a0410c89217c4968d at 10.2.1.68sip:6785008449 at 10.2.1.68sip:19186128830 at my.sipprovider.comgent:
OpenSIPS SIP (1.6.3-notls (x86_64/linux))\r\n\r\n\r\n\r\nngP\002"



----------------------------------------------------------------------

Comment By: Robert Smith (denodaeus)
Date: 2011-03-03 17:41

Message:
I have the full contents of memdump=2 in a logfile (it's quite large) that
I can send privately if it's needed.  I don't know if this is providing a
lot of detail about the crash though, it's pretty ambiguously to the memory
manager.

----------------------------------------------------------------------

Comment By: Robert Smith (denodaeus)
Date: 2011-03-03 17:40

Message:
Putting memory support in, and only memdumping (level 2 as log level is
currently 3 and this is a production machine) is causing a sigabrt to
occur.  The pertinent info:

/usr/local/opensips/sbin/opensips[27869]: CRITICAL:core:qm_debug_frag: 
qm_*: prev. fragm. tail overwritten(c0c0c0c0,
abcdefe5)[0x2b71c0960470:0x2b71c09604a0]!

<aidanna> seems like the fragment is here: 
2011-03-03T15:01:29.496041+00:00 /usr/local/opensips/sbin/opensips[27846]: 
   17533. N  address=0x2b71c09604a0 frag=0x2b71c0960470 size=40 used=1
<aidanna> 2011-03-03T15:01:29.496049+00:00
/usr/local/opensips/sbin/opensips[27846]:             alloc'd from proxy.c:
hostent_shm_cpy(83)


----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2011-02-18 22:06

Message:
Hi Robert,

Thanks for the info. It seams to be a generic memory corruption which is a
a bit impossible to trace only from core file.

The best way to trace this is by enabling and running with the memory
debugger:

1. edit Makefile.defs file and remove from the DEFS string the F_MALLOC
define (delete "-DF_MALLOC" line) and add the DBG_QM_MALLOC define (insert
a "-DDBG_QM_MALLOC" line).

2. recompile everything

3. set memlog=6 and memdump=1 in your configuration script (to get the
memory dump).

Start opensips and if memory corruption is detected, opensips will stop
generating core file and a memory dump.

Regards,
Bogdan

----------------------------------------------------------------------

Comment By: Robert Smith (denodaeus)
Date: 2011-02-18 19:09

Message:
Hi bogdan -- new crash file is attached, uac_crash.corefile.txt.

Thanks,

----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2011-02-18 16:53

Message:
Hi Robert,

to get both core file,  you can just set the /proc/sys/kernel/core_pattern
to some name which will be unique per process.  
Or just set /proc/sys/kernel/core_uses_pid to '1', so you'll get a
core.PID file for every process that crashed.

And post the backtrace of the first corefile.

Regards,
Bogdan

----------------------------------------------------------------------

Comment By: Robert Smith (denodaeus)
Date: 2011-02-16 11:47

Message:
The shutdown was a sig11 which caused the attendant process to exit on
sig15.  I'll try to look at the root cause, but I originally thought this
was off of the sip receiver process.

----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2011-02-15 22:59

Message:
Hi Robert,

The traces you posted show a crash in the shutdown procedure - can you
check in the opensips logs to see if the shutdown was a normal one or a
result of another crash? I try to exclude the possibility of a double
coredump (one of the opensips procs crashes a generates a cores -> entire
opensips starts to shutdown a crashes again -> second corefile,
overwritting the first one).

Regards,
Bogdan

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3182321&group_id=232389



More information about the Devel mailing list