[OpenSIPS-Devel] [ opensips-Bugs-2860109 ] segfault caused by usrloc:db_timer_udomain when db goes away

SourceForge.net noreply at sourceforge.net
Fri Sep 25 09:54:54 CEST 2009


Bugs item #2860109, was opened at 2009-09-16 19:28
Message generated for change (Comment added) made by bogdan_iancu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=2860109&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: modules
Group: None
Status: Open
Resolution: Fixed
Priority: 7
Private: No
Submitted By: Phil D'Amore (ptdamore)
Assigned to: Bogdan-Andrei Iancu (bogdan_iancu)
Summary: segfault caused by usrloc:db_timer_udomain when db goes away

Initial Comment:
db_timer_udomain only initializes the keys for the delete query if there isn't an existing prepared statement  (my_ps == NULL).  This works as long as the connection doesn't go away for a long period of time.  If the db goes away for 2*timer_interval, the stored statement context is no longer valid, and the underlying db_delete function needs to re-create it from the original query.  However, since the keys aren't populated, you get a segfault when there is an attempt to read them.

The patch attached removes the conditional setup of the keys in db_timer_udomain.

This was with the mysql driver, but it seems like good practice for any driver.

----------------------------------------------------------------------

>Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2009-09-25 10:54

Message:
That is true - even if the context is not destroyed (to keep the re-init
info), the stmt pointer must be set to null to allow re-init in the
future....

I made an additional fix, so if you could test it again, it will be
helpful.

Thanks and regards,
Bogdan 

----------------------------------------------------------------------

Comment By: Phil D'Amore (ptdamore)
Date: 2009-09-24 19:11

Message:
This prevents the crash, but it seems to create a new issue.  Now the
statement doesn't get re-init'd.  The syslog looks like this:

Sep 24 11:22:04 sipprx-01 /usr/sbin/opensips[8680]:
INFO:db_mysql:db_mysql_do_pr
epared_query: disconect event for 0x8089e0 
Sep 24 11:22:04 sipprx-01 /usr/sbin/opensips[8680]:
INFO:db_mysql:reset_all_stat
ements: reseting all statements on connection: (0x8088e0) 0x8089e0 
Sep 24 11:22:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_conn
ect: driver error(2003): Can't connect to MySQL server on '10.4.20.3'
(111) 
Sep 24 11:22:04 sipprx-01 last message repeated 2 times
Sep 24 11:22:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:usrloc:db_timer_udomai
n: failed to delete from table location 
Sep 24 11:22:04 sipprx-01 /usr/sbin/opensips[8680]: ERROR:usrloc:timer:
synchron
izing cache failed 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
INFO:db_mysql:re_init_statem
ent:  query  is <delete from location where expires<? AND expires!=?>,
ptr=(nil)
 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
INFO:db_mysql:re_init_statem
ent: disconect event for 0x8089e0 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_conn
ect: driver error(2003): Can't connect to MySQL server on '10.4.20.3'
(111) 
Sep 24 11:22:34 sipprx-01 last message repeated 2 times
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:re_init_state
ment: failed while mysql_stmt_prepare() 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_do_p
repared_query: failed to re-init statement! 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:usrloc:db_timer_udomai
n: failed to delete from table location 
Sep 24 11:22:34 sipprx-01 /usr/sbin/opensips[8680]: ERROR:usrloc:timer:
synchron
izing cache failed 


<here's where it would have crashed before, but now it keeps going>

Sep 24 11:23:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_do_p
repared_query: mysql_stmt_bind_param() failed: Statement not prepared 
Sep 24 11:23:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_do_p
repared_query: param 0 was found as type 12 
Sep 24 11:23:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:db_mysql:db_mysql_do_p
repared_query: param 1 was found as type 12 
Sep 24 11:23:04 sipprx-01 /usr/sbin/opensips[8680]:
ERROR:usrloc:db_timer_udomai
n: failed to delete from table location 
Sep 24 11:23:04 sipprx-01 /usr/sbin/opensips[8680]: ERROR:usrloc:timer:
synchron
izing cache failed 


It looks like since ctx->stmt is now no longer NULL, the statement is not
re-init'd in db_mysql_do_prepared_query:

if ( ctx->stmt==NULL && re_init_statement(conn, pq_ptr, ctx)!=0 )

re_init_statement doesn't happen any more, so binding the parameters
fails.  This becomes an unrecoverable condition.  Opensips is still
running, but any statements that are in this state are perpetually useless
now, and will always cause an error.

----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2009-09-22 18:30

Message:
Thanks for the debugging - I see what is happening, and indeed there is a
bug. In re_init_statement(), if the prepared failed, the context must not
be destroyed - maybe next time you can connect to the server. The context
is to be destroyed only if the statement cannot be initialized. 

So, the case when the looping is terminated (all 2 cycles) and no success,
the error must not be triggered, but simply failure - there is a return
missing after the for{}.

Please update from SVN and let me know if work ok (no crash and reconnect
also) - see revision 6177.

Regards,
bogdan

----------------------------------------------------------------------

Comment By: Phil D'Amore (ptdamore)
Date: 2009-09-17 22:24

Message:
As promised:

(gdb) frame
#0  0x00000000004d4578 in db_print_where (_c=0x808900, _b=0x76357b "", 
    _l=65509, _k=0x7fff06e92610, _o=0x7fff06e92600, _v=0x7fff06e925c0,
_n=2, 
    val2str=0x2b5aa3c2c270 <db_mysql_val2str>) at db/db_ut.c:284
284				ret = snprintf(_b + len, _l - len, "%.*s%s", 
(gdb) bt
#0  0x00000000004d4578 in db_print_where (_c=0x808900, _b=0x76357b "", 
    _l=65509, _k=0x7fff06e92610, _o=0x7fff06e92600, _v=0x7fff06e925c0,
_n=2, 
    val2str=0x2b5aa3c2c270 <db_mysql_val2str>) at db/db_ut.c:284
#1  0x00000000004d2135 in db_do_delete (_h=0x808900, _k=0x7fff06e92610, 
    _o=0x7fff06e92600, _v=0x7fff06e925c0, _n=2, 
    val2str=0x2b5aa3c2c270 <db_mysql_val2str>, 
    submit_query=0x2b5aa3c24830 <db_mysql_submit_dummy_query>)
    at db/db_query.c:213
#2  0x00002b5aa3c2a119 in db_mysql_delete (_h=0x808900, _k=0x7fff06e92610,

    _o=0x7fff06e92600, _v=0x7fff06e925c0, _n=2) at dbase.c:883
#3  0x00002b5aa46c300f in db_timer_udomain (_d=0x2b5aa82a77d8) at
udomain.c:620
#4  0x00002b5aa46b9ac8 in synchronize_all_udomains () at dlist.c:557
#5  0x00002b5aa46c9f0e in timer (ticks=8423680, param=0x76357b) at
ul_mod.c:384
#6  0x0000000000486f13 in start_timer_processes () at timer.c:266
#7  0x000000000042987a in main (argc=8, argv=0x7fff06e92868) at main.c:819

The full code for the relevant line is:

284				ret = snprintf(_b + len, _l - len, "%.*s%s", 
285					_k[i]->len, _k[i]->s, _o[i]);

And:

(gdb) print i
$3 = 0
(gdb) print _k[i]
$4 = (const db_key_t) 0x0

It's also worth noting that _o is really not initialized either, but since
it's not being dereferenced, it wouldn't be involved here.

Also for your viewing pleasure, here is the relevant stuff generated from
this in the syslog, starting from the point where I took my db server down.
 My usrloc sync timer interval is 30 seconds...

Sep 17 15:13:18 sipprx-01 /usr/sbin/opensips[20149]:
INFO:db_mysql:db_mysql_do_p
repared_query: disconect event for 0x808a00 
Sep 17 15:13:18 sipprx-01 /usr/sbin/opensips[20149]:
INFO:db_mysql:reset_all_sta
tements: reseting all statements on connection: (0x808900) 0x808a00 
Sep 17 15:13:18 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:db_mysql:db_mysql_con
nect: driver error(2003): Can't connect to MySQL server on '10.2.0.68'
(111) 
Sep 17 15:13:18 sipprx-01 last message repeated 2 times
Sep 17 15:13:18 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:usrloc:db_timer_udoma
in: failed to delete from table location 
Sep 17 15:13:18 sipprx-01 /usr/sbin/opensips[20149]: ERROR:usrloc:timer:
synchro
nizing cache failed 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
INFO:db_mysql:re_init_state
ment:  query  is <delete from location where expires<? AND expires!=?>,
ptr=(nil
) 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
INFO:db_mysql:re_init_state
ment: disconect event for 0x808a00 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:db_mysql:db_mysql_con
nect: driver error(2003): Can't connect to MySQL server on '10.2.0.68'
(111) 
Sep 17 15:13:48 sipprx-01 last message repeated 2 times
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:db_mysql:re_init_stat
ement: failed while mysql_stmt_prepare() 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:db_mysql:db_mysql_do_
prepared_query: failed to re-init statement! 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]:
ERROR:usrloc:db_timer_udoma
in: failed to delete from table location 
Sep 17 15:13:48 sipprx-01 /usr/sbin/opensips[20149]: ERROR:usrloc:timer:
synchro
nizing cache failed 
Sep 17 15:14:18 sipprx-01 kernel: opensips[20149]: segfault at
0000000000000008 
rip 00000000004d4578 rsp 00007fff06e924a0 error 4
Sep 17 15:14:18 sipprx-01 media-dispatcher[6993]:
[OpenSIPSControlProtocol,1194,
] Connection to OpenSIPS lost: Connection was closed cleanly.
Sep 17 15:14:18 sipprx-01 /usr/sbin/opensips[20123]:
INFO:core:handle_sigs: chil
d process 20149 exited by a signal 11 
Sep 17 15:14:18 sipprx-01 /usr/sbin/opensips[20123]:
INFO:core:handle_sigs: core
 was generated 
Sep 17 15:14:18 sipprx-01 /usr/sbin/opensips[20123]:
INFO:core:handle_sigs: term
inating due to SIGCHLD 

Thanks,
Phil

----------------------------------------------------------------------

Comment By: Phil D'Amore (ptdamore)
Date: 2009-09-17 22:01

Message:
Hi Bogdan...

I was getting this error in 1.5.3.  Sorry, I should have been explicit
with this before, but I was in a bit too much of a hurry.  1.5.3 has
dbase.c from r5886 which seems to include all changes on the 1.5 branch to
date.

I see some of the changes you might have been referring to and I believe
this is a different problem.  I see that the prep_stmt struct from db_mysql
does in fact cache the statement, which also puzzled me as I looked at this
problem.  Now that I've thought about it, my fix is probably not optimal,
and db_mysql could be doing more to use that cached statement.  The problem
I saw came up in the db_mysql_delete function, although all the db_mysql_OP
functions seem to be structured the same way.

Here's how I'm seeing it:

When the connection goes away, the re_init_statement function removes the
offending statement from the prep_stmt->stmts linked list.  Since, at least
in the case of the usrloc timer task, there is only one statement, the
result is that prep_stmt->stmts is now NULL.  As a result, the next call to
db_mysql_delete calls has_stmt_ctx, which now returns a false value.  This
triggers a call to db_do_delete, which expects the keys to be populated.  I
believe that since the keys are not populated (after all, as far as the
timer function is concerned my_ps is not NULL, so it didn't bother to
populate the keys), db_do_delete winds up exploding.

All of my test servers are now running my patch, but I'll back one out and
reproduce this and get a stack trace you can look at.

Thanks,
Phil

----------------------------------------------------------------------

Comment By: Bogdan-Andrei Iancu (bogdan_iancu)
Date: 2009-09-16 20:52

Message:
That is rather strange as the prepare statements (in mysql module) do catch
the original string query that created the statement and re-use it each
time the statement needs to be re-init. So, the keys are not actually
needed any more (after the original init).

Indeed, there was a problem , but not related to keys, but to the table -
actually this bug was fixed on SVN 1.5 branch last week, so if you update
it should work.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=2860109&group_id=232389



More information about the Devel mailing list