[OpenSIPS-Users] opensips HA resource script (for Heartbeat)
Bogdan-Andrei Iancu
bogdan at voice-system.ro
Tue Dec 28 15:45:07 CET 2010
Hi Iñaki,
Iñaki Baz Castillo wrote:
> 2010/12/28 Alexandr A. Alexandrov <shurrman at gmail.com>:
>
>> Hi, All.
>>
>> This is an issue of writing a correct script, nothing more. :-)
>> There are several possibilities, strating from simple process lookup (like
>> pgrep -f opensips), ending using MI from such a script.
>>
>
> No, this is a bug in opensips itself since, when running daemonized,
> the process returns 0 even if the daemonized (main) process fails to
> start (due to any module configuration error).
>
> Any exotic check you add after executing the binary is just a
> workaround. Any service/daemon MUST return an accurate exit status
> code, so other applications (i.e. HA) can rely on such a value.
>
>
You may call it a design bug - the current return code reflects only the
pre-daemonize init without including the child init for example.
To be honest, so far I succesfully used the pid file info to check if my
opensips properly started or not - but maybe this kind of test is not
suitable in all the cases.
>
>>> This makes OpenSIPS not valid for full HA environment, so be careful.
>>>
>> I will make my opensips valid
>>
>
> Can I ask how? Imagine you "dbaliases" module access to a different
> database, and such database server is "protected" with iptables
> dropping any incoming TCP connection.
>
> You run opensips and the module "dbaliases" tries to establish the
> connection with the BD server. It could take LONG time until it raises
> a timeout error (maybe minutes). After such time the main process
> dies, but before such moment the main process was still running. If
> your "valid" init/LSB/OCF script checks the process status 5 seconds
> after calling the binay, it would return SUCCESS status (while in
> fact, opensips will die soon). No perfect workaround here. The daemon
> itself MUST return a real and accurate code.
>
>
>
I'm not 100% convinced that this change will totally fix the problem -
even if we make the initial process to report a correct and relevant
return code, what will happen if this will happen if this return code
comes after minutes, following some DNS/DB queries done by module init
functions ? Is it still useful to have the return code after 2 minutes ?
> NOTE: A way to improve it (in OpenSIPS code):
>
> When invoking "opensips", the parent process opens a PIPE for reading,
> and the daemonized process open it for writting. The parent process
> waits until the daemonized process writes into the PIPE (it writes its
> status which is the status code returned by the parent process). This
> is already implemented in Kamailio/SIP-router.
>
As far as I understand this will partially fix the problem, by
addressing the errors reported by module init functions. The errors
generated by the child init functions will not be caught by the parent
process, so more or less we are back to square one :)...there is a need
for something more extensive, to get reporting from all opensips
processes (daemonize, worker processes, timer, aux procs)...
Regards,
Bogdan
--
Bogdan-Andrei Iancu
OpenSIPS Event - expo, conf, social, bootcamp
2 - 4 February 2011, ITExpo, Miami, USA
www.voice-system.ro
More information about the Users
mailing list