[OpenSIPS-Users] opensips HA resource script (for Heartbeat)

Bogdan-Andrei Iancu bogdan at voice-system.ro
Tue Dec 28 15:45:07 CET 2010


Hi Iñaki,

Iñaki Baz Castillo wrote:
> 2010/12/28 Alexandr A. Alexandrov <shurrman at gmail.com>:
>   
>> Hi, All.
>>
>> This is an issue of writing a correct script, nothing more. :-)
>> There are several possibilities, strating from simple process lookup (like
>> pgrep -f opensips), ending using MI from such a script.
>>     
>
> No, this is a bug in opensips itself since, when running daemonized,
> the process returns 0 even if the daemonized (main) process fails to
> start (due to any module configuration error).
>
> Any exotic check you add after executing the binary is just a
> workaround. Any service/daemon MUST return an accurate exit status
> code, so other applications (i.e. HA) can rely on such a value.
>
>   
You may call it a design bug - the current return code reflects only the 
pre-daemonize init without including the child init for example.

To be honest, so far I succesfully used the pid file info to check if my 
opensips properly started or not - but maybe this kind of test is not 
suitable in all the cases.

>   
>>> This makes OpenSIPS not valid for full HA environment, so be careful.
>>>       
>> I will make my opensips valid
>>     
>
> Can I ask how? Imagine you "dbaliases" module access to a different
> database, and such database server is "protected" with iptables
> dropping any incoming TCP connection.
>
> You run opensips and the module "dbaliases" tries to establish the
> connection with the BD server. It could take LONG time until it raises
> a timeout error (maybe minutes). After such time the main process
> dies, but before such moment the main process was still running. If
> your "valid" init/LSB/OCF script checks the process status 5 seconds
> after calling the binay, it would return SUCCESS status (while in
> fact, opensips will die soon). No perfect workaround here. The daemon
> itself MUST return a real and accurate code.
>
>
>   
I'm not 100% convinced that this change will totally fix the problem - 
even if we make the initial process to report a correct and relevant 
return code, what will happen if this will happen if this return code 
comes after minutes, following some DNS/DB queries done by module init 
functions ? Is it still useful to have the return code after 2 minutes ?

> NOTE: A way to improve it (in OpenSIPS code):
>
> When invoking "opensips", the parent process opens a PIPE for reading,
> and the daemonized process open it for writting. The parent process
> waits until the daemonized process writes into the PIPE (it writes its
> status which is the status code returned by the parent process). This
> is already implemented in Kamailio/SIP-router.
>   

As far as I understand this will partially fix the problem, by 
addressing the errors reported by module init functions. The errors 
generated by the child init functions will not be caught by the parent 
process, so more or less we are back to square one :)...there is a need 
for something more extensive, to get reporting from all opensips 
processes (daemonize, worker processes, timer, aux procs)...

Regards,
Bogdan


-- 
Bogdan-Andrei Iancu
OpenSIPS Event - expo, conf, social, bootcamp
2 - 4 February 2011, ITExpo, Miami,  USA
www.voice-system.ro




More information about the Users mailing list