<html><body>
<p>Hi Bogdan,<br>
<br>
I think there is likely another race condition in the tm timer code. The problem is when the same struct timer_link is used on two or more timer lists. For example, uac->request.fr_timer is used on both FR_TIMER_LIST and FR_INV_TIMER_LIST, and the call on t_reply.c:1532 moves it between timer lists. The problem is that the FR_TIMER_LIST is not locked while the move takes place. That is, set_timer() locks the list it is moving to, but not the list it is moving from. The thus call to remove_timer_unsafe() is not guarded by the appropriate lock.<br>
<br>
I found this while searching for the earlier problem. I added check into set_timer():
<ul> if (new_tl->timer_list==DETACHED_LIST){<br>
LM_CRIT("set_timer for %d list called on a \"detached\" "<br>
"timer -- ignoring: %p\n", list_id, new_tl);<br>
goto end;<br>
}<br>
/* make sure I'm not already on a list */<br>
if (new_tl->timer_list!=0 && new_tl->timer_list!=list) {<br>
LM_CRIT("set_timer for %d list called while on other list %d", list_id, new_tl->timer_list->id);<br>
abort();<br>
}<br>
remove_timer_unsafe( new_tl );<br>
<br>
insert_timer_unsafe( list, new_tl, timeout +<br>
((timer_id2type[list_id]==UTIME_TYPE)?get_uticks():get_ticks()));<br>
end:<br>
</ul>
And I got:<br>
<br>
Oct 26 21:37:37 org-sip01 /var/run/openser/opensips-proxy[9708]: CRITICAL:tm:set_timer: set_timer for 1 list called while on other list 0<br>
<br>
Regards,<br>
Kennard<br>
<br>
<br>
<img width="16" height="16" src="cid:1__=07BBFD5ADF89E6128f9e8a93df9@logitech.com" border="0" alt="Inactive hide details for Bogdan-Andrei Iancu ---10/25/2010 06:35:04 AM---Done! This bug is affecting trunk only."><font color="#424282">Bogdan-Andrei Iancu ---10/25/2010 06:35:04 AM---Done! This bug is affecting trunk only.</font><br>
<br>
<font size="2" color="#5F5F5F">From:        </font><font size="2">Bogdan-Andrei Iancu <bogdan@voice-system.ro></font><br>
<font size="2" color="#5F5F5F">To:        </font><font size="2">OpenSIPS users mailling list <users@lists.opensips.org></font><br>
<font size="2" color="#5F5F5F">Date:        </font><font size="2">10/25/2010 06:35 AM</font><br>
<font size="2" color="#5F5F5F">Subject:        </font><font size="2">Re: [OpenSIPS-Users] opensips tm timer core dump</font><br>
<font size="2" color="#5F5F5F">Sent by:        </font><font size="2">users-bounces@lists.opensips.org</font><br>
<hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br>
<br>
<br>
<tt>Done!<br>
<br>
This bug is affecting trunk only.<br>
<br>
Regards,<br>
Bogdan<br>
<br>
Bogdan-Andrei Iancu wrote:<br>
> Hi Kennard,<br>
><br>
> You are perfectly right - there was a change in moving the setT(cell) <br>
> upper in the function (we needed it in the callbacks), and this bug <br>
> looks like a side effect.<br>
><br>
> I will make the fix asap !<br>
><br>
> Thanks and regards,<br>
> Bogdan<br>
><br>
> Kennard_White@logitech.com wrote:<br>
> <br>
>> Hi Bogdan,<br>
>><br>
>> I believe I found the problem. When sip_msg_cloner() within <br>
>> build_cell() fails due to out-of-mem, and dangling pointer to the cell <br>
>> is left in the global transaction pointer. Later on the post_cb() code <br>
>> attempts to clean this up, and "resurrects" the now-free memory, and <br>
>> in particular puts it on a wait timer. My guess is that later on this <br>
>> memory is allocated into a new transaction which eventually uses the <br>
>> same wait timer link, and the two threads fight it out.<br>
>><br>
>> Adding a set_t(0) fixes the problem. While looking, I believe I found <br>
>> a memory leak in uac creation code that is also triggered by out of <br>
>> memory conditions. Please see attached patch.<br>
>><br>
>> The "problem" also manifest itself as a core dump with <br>
>> insert_timer_unsafe. Once I disabled optimization, I saw multiple <br>
>> variations within this function. The most common was that ptr==tl <br>
>> after the search. The sequence of events for this is: new cell <br>
>> created, stored into global T, free'd, added to wait-list by cleanup <br>
>> code from global T, memory re-allocated into new cell, timer link <br>
>> zero'd, and then added again to wait list. At least that is my best guess.<br>
>><br>
>> Regards,<br>
>> Kennard<br>
>><br>
>> /(See attached file: opensips-tm-cell.patch)/<br>
>><br>
>> Inactive hide details for Bogdan-Andrei Iancu ---10/22/2010 07:02:36 <br>
>> AM---Hi Kennard, I suppose the bt is the same ? do you <br>
>> stiBogdan-Andrei Iancu ---10/22/2010 07:02:36 AM---Hi Kennard, I <br>
>> suppose the bt is the same ? do you still have the core file ?<br>
>><br>
>> From: Bogdan-Andrei Iancu <bogdan@voice-system.ro><br>
>> To: OpenSIPS users mailling list <users@lists.opensips.org><br>
>> Date: 10/22/2010 07:02 AM<br>
>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump<br>
>> Sent by: users-bounces@lists.opensips.org<br>
>><br>
>> ------------------------------------------------------------------------<br>
>><br>
>><br>
>><br>
>> Hi Kennard,<br>
>><br>
>> I suppose the bt is the same ? do you still have the core file ?<br>
>><br>
>> Regards,<br>
>> Bogdan<br>
>><br>
>> Kennard_White@logitech.com wrote:<br>
>> <br>
>>> Hi Bodgen,<br>
>>><br>
>>> I replicated the error. Unfortunately the entire insert_timer_unsafe<br>
>>> and been in-lined and little is available:<br>
>>><br>
>>> Program terminated with signal 11, Segmentation fault.<br>
>>> #0 0x00007f8b8356c2c2 in insert_timer_unsafe (new_tl=0x7f8b7a54e310,<br>
>>> list_id=WT_TIMER_LIST, ext_timeout=<value optimized out>) at timer.c:731<br>
>>> 731 timer.c: No such file or directory.<br>
>>> in timer.c<br>
>>> (gdb) print tl<br>
>>> $1 = <value optimized out><br>
>>> (gdb) print *tl<br>
>>> Cannot access memory at address 0x0<br>
>>> (gdb) print ptr<br>
>>> $2 = <value optimized out><br>
>>> (gdb) print *ptr<br>
>>> Cannot access memory at address 0x0<br>
>>> (gdb) print *new_tl<br>
>>> No symbol "new_tl" in current context.<br>
>>> (gdb) up<br>
>>> #1 set_1timer (new_tl=0x7f8b7a54e310, list_id=WT_TIMER_LIST,<br>
>>> ext_timeout=<value optimized out>) at timer.c:904<br>
>>> 904 in timer.c<br>
>>> (gdb) print *new_tl<br>
>>> $3 = {next_tl = 0x0, prev_tl = 0x0, ld_tl = 0x0, time_out = 0,<br>
>>> timer_list = 0x0, deleted = 0}<br>
>>> (gdb) print list<br>
>>> $4 = <value optimized out><br>
>>> (gdb) print timeout<br>
>>> $5 = 32<br>
>>> (gdb) print new_tl<br>
>>> $6 = (struct timer_link *) 0x7f8b7a54e310<br>
>>><br>
>>> I'll keep the core for a while -- please let me know if there is<br>
>>> anything else I can try.<br>
>>><br>
>>> Thanks,<br>
>>> Kennard<br>
>>><br>
>>> Inactive hide details for Bogdan-Andrei Iancu ---10/08/2010 04:40:47<br>
>>> AM---Hi Kennard, Ok, keep the core next time :)Bogdan-Andrei Iancu<br>
>>> ---10/08/2010 04:40:47 AM---Hi Kennard, Ok, keep the core next time :)<br>
>>><br>
>>> From: Bogdan-Andrei Iancu <bogdan@voice-system.ro><br>
>>> To: OpenSIPS users mailling list <users@lists.opensips.org><br>
>>> Date: 10/08/2010 04:40 AM<br>
>>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump<br>
>>> Sent by: users-bounces@lists.opensips.org<br>
>>><br>
>>> ------------------------------------------------------------------------<br>
>>><br>
>>><br>
>>><br>
>>> Hi Kennard,<br>
>>><br>
>>> Ok, keep the core next time :)<br>
>>><br>
>>> Regards,<br>
>>> Bogdan<br>
>>><br>
>>> Kennard_White@logitech.com wrote:<br>
>>> <br>
>>>> Hi Bogden,<br>
>>>><br>
>>>> Thanks for explaining the child processes involved -- I misunderstood<br>
>>>> what was happening.<br>
>>>><br>
>>>> Unfortunately, I don't have the core anymore. My recollection is that<br>
>>>> I couldn't print anything useful due to compiler optimization. That<br>
>>>> said, this should re-create pretty easily, and I'll get more dumps<br>
>>>> next time it happens.<br>
>>>><br>
>>>> Regards,<br>
>>>> Kennard<br>
>>>><br>
>>>> Inactive hide details for Bogdan-Andrei Iancu ---10/05/2010 01:41:38<br>
>>>> AM---Hi Kennard, The core was generated by process 22255:Bogdan-Andrei<br>
>>>> Iancu ---10/05/2010 01:41:38 AM---Hi Kennard, The core was generated<br>
>>>> by process 22255:<br>
>>>><br>
>>>> From: Bogdan-Andrei Iancu <bogdan@voice-system.ro><br>
>>>> To: OpenSIPS users mailling list <users@lists.opensips.org><br>
>>>> Date: 10/05/2010 01:41 AM<br>
>>>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump<br>
>>>> Sent by: users-bounces@lists.opensips.org<br>
>>>><br>
>>>><br>
>>>> <br>
>> ------------------------------------------------------------------------<br>
>> <br>
>>>><br>
>>>> Hi Kennard,<br>
>>>><br>
>>>> The core was generated by process 22255:<br>
>>>> [22238]: INFO:core:handle_sigs: child process 22255 exited by a<br>
>>>> signal 11<br>
>>>><br>
>>>> and this process also reported mem problems:<br>
>>>> [22255]: ERROR:tm:new_t: out of mem<br>
>>>><br>
>>>> Can you print the "tl" or "ptr" variables in frame 0?<br>
>>>><br>
>>>> Regards,<br>
>>>> Bogdan<br>
>>>><br>
>>>> Kennard_White@logitech.com wrote:<br>
>>>> <br>
>>>>> Running against opensips HEAD, I got a segfault in the tm timer <br>
>>>>> <br>
>> code.<br>
>> <br>
>>>>> I believe this is triggered by running out of shared memory.<br>
>>>>><br>
>>>>><br>
>>>>> The stack trace:<br>
>>>>><br>
>>>>> (gdb) where<br>
>>>>> #0 0x00007fe8f8d96212 in insert_timer_unsafe (new_tl=0x7fe8f66337b0,<br>
>>>>> list_id=WT_TIMER_LIST, ext_timeout=<value optimized out>) at<br>
>>>>> <br>
>>> timer.c:731<br>
>>> <br>
>>>>> #1 set_1timer (new_tl=0x7fe8f66337b0, list_id=WT_TIMER_LIST,<br>
>>>>> ext_timeout=<value optimized out>) at timer.c:904<br>
>>>>> #2 0x00007fe8f8d78ac8 in t_release_transaction <br>
>>>>> <br>
>> (trans=0x7fe8f6633730)<br>
>> <br>
>>>>> at t_funcs.c:122<br>
>>>>> #3 0x00007fe8f8d808e5 in t_unref (p_msg=<value optimized out>)<br>
>>>>> at t_lookup.c:1152<br>
>>>>> #4 0x0000000000483ae5 in exec_post_req_cb ()<br>
>>>>> #5 0x000000000046c1e4 in receive_msg ()<br>
>>>>> #6 0x00000000004bc77c in udp_rcv_loop ()<br>
>>>>> #7 0x000000000042de9c in main ()<br>
>>>>><br>
>>>>> The offending code (I believe):<br>
>>>>> if (tl->time_out==ptr->time_out) {<br>
>>>>> tl->ld_tl = ptr->ld_tl<br>
>>>>> ptr->ld_tl = 0;<br>
>>>>> tl->ld_tl->ld_tl = tl; <-- SEG FAULT HERE (according to trace)<br>
>>>>> } else {<br>
>>>>> tl->ld_tl = tl;<br>
>>>>> }<br>
>>>>><br>
>>>>> Unfortunately, due to optimization I cannot dump anything <br>
>>>>> <br>
>> useful, and<br>
>> <br>
>>>>> I'm not convinced the actual fault is on the line indicated. <br>
>>>>> <br>
>> Note that<br>
>> <br>
>>>>> the core dump is not one of the processes that reported out of <br>
>>>>> <br>
>> memory.<br>
>> <br>
>>>>> Maybe one of the other processes left the timer list in a corrupt<br>
>>>>> <br>
>>> state?<br>
>>> <br>
>>>>> The log file:<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:<br>
>>>>> ERROR:tm:sip_msg_cloner: no more share memory<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:<br>
>>>>> ERROR:tm:new_t: out of mem<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:<br>
>>>>> ERROR:tm:t_newtran: new_t failed<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:<br>
>>>>> WARNING:core:fm_malloc: Not enough free memory, will atempt<br>
>>>>> <br>
>>>> defragmenation<br>
>>>> <br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:<br>
>>>>> ERROR:tm:sip_msg_cloner: no more share memory<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:<br>
>>>>> ERROR:tm:new_t: out of mem<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:<br>
>>>>> ERROR:tm:t_newtran: new_t failed<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:<br>
>>>>> INFO:core:handle_sigs: child process 22255 exited by a signal 11<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:<br>
>>>>> INFO:core:handle_sigs: core was generated<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:<br>
>>>>> INFO:core:handle_sigs: terminating due to SIGCHLD<br>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22256]:<br>
>>>>> INFO:core:sig_usr: signal 15 received<br>
>>>>><br>
>>>>><br>
>>>>> <br>
>>> ------------------------------------------------------------------------<br>
>>> <br>
>>>>> _______________________________________________<br>
>>>>> Users mailing list<br>
>>>>> Users@lists.opensips.org<br>
>>>>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>>>>> <br>
>>>>> <br>
>>>> --<br>
>>>> Bogdan-Andrei Iancu<br>
>>>> OpenSIPS Bootcamp<br>
>>>> 15 - 19 November 2010, Edison, New Jersey, USA<br>
>>>> </tt><tt>www.voice-system.ro</tt><tt><br>
>>>><br>
>>>><br>
>>>> _______________________________________________<br>
>>>> Users mailing list<br>
>>>> Users@lists.opensips.org<br>
>>>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>>>><br>
>>>><br>
>>>> <br>
>> ------------------------------------------------------------------------<br>
>> <br>
>>>> _______________________________________________<br>
>>>> Users mailing list<br>
>>>> Users@lists.opensips.org<br>
>>>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>>>> <br>
>>>> <br>
>>> --<br>
>>> Bogdan-Andrei Iancu<br>
>>> OpenSIPS Bootcamp<br>
>>> 15 - 19 November 2010, Edison, New Jersey, USA<br>
>>> </tt><tt>www.voice-system.ro</tt><tt><br>
>>><br>
>>><br>
>>> _______________________________________________<br>
>>> Users mailing list<br>
>>> Users@lists.opensips.org<br>
>>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>>><br>
>>> ------------------------------------------------------------------------<br>
>>><br>
>>> _______________________________________________<br>
>>> Users mailing list<br>
>>> Users@lists.opensips.org<br>
>>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>>> <br>
>>> <br>
>> -- <br>
>> Bogdan-Andrei Iancu<br>
>> OpenSIPS Bootcamp<br>
>> 15 - 19 November 2010, Edison, New Jersey, USA<br>
>> </tt><tt>www.voice-system.ro</tt><tt><br>
>><br>
>><br>
>> _______________________________________________<br>
>> Users mailing list<br>
>> Users@lists.opensips.org<br>
>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>><br>
>> ------------------------------------------------------------------------<br>
>><br>
>> _______________________________________________<br>
>> Users mailing list<br>
>> Users@lists.opensips.org<br>
>> </tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
>> <br>
>> <br>
><br>
><br>
> <br>
<br>
<br>
-- <br>
Bogdan-Andrei Iancu<br>
OpenSIPS Bootcamp<br>
15 - 19 November 2010, Edison, New Jersey, USA<br>
</tt><tt>www.voice-system.ro</tt><tt><br>
<br>
<br>
_______________________________________________<br>
Users mailing list<br>
Users@lists.opensips.org<br>
</tt><tt><a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a></tt><tt><br>
</tt><br>
</body></html>