[OpenSIPS-Users] A behavior of Clusterer module during networking issues

Sat Apr 11 11:11:23 EST 2020

On 09.04.2020 15:03, Donat Zenichev wrote:
> I have question and it's almost theoretic.
> The question relates to Clusterer module and its behavior.
>
> Will Clusterer module solve this contradiction on its own?
> And if so, to which side the precedence is given?
>
> The other way around could be to manually re-activate all services,
> when all the cluster resumes into normal working process (all nodes 
> are present).
> Thus this gives us a warranty that shared tag will only be activated 
> on one of the sides.

Hi, Donat!

A very good question and one that we had to answer ourselves when we 
came up with the current design.  To begin with, in your scenario, for 
all OpenSIPS 2.4+ clusterer versions, after the link between the nodes 
comes back online, you will have the following:

* node A: ACTIVE state (holds the VIP), sharing tag state: ACTIVE (1)
* node B: BACKUP state, sharing tag state: ACTIVE (1)

The main reason behind this inconsistent state is that we did not 
provide an MI command to force a sharing tag to BACKUP (0), which could 
be triggered on node B's transition from ACTIVE -> BACKUP once the link 
is restored, so recovering from this state will not work automatically - 
you have to provide handling for this scenario as well (see last paragraph).

Reasoning behind this design
----------------------------

Ultimately, our priority was not to get into solving consensus problems, 
Paxos algorithms, etc.  What we wanted is a robust active/backup 
solution which you could flip back and forth with ease, thus achieving 
both High-Availability and easy maintenance.  By not providing a 
"set_sharing_tag vip 0" command, we _avoid_ the situation where, due to 
a developer error, both tags end up being BACKUP (0)!!  In such a 
scenario: there will be no more CDRs and you will be able to run 
infinite CPS/CC through that instance, since all call profile counters 
are equal to 0.  None of the instances take responsibility for any call 
running through them, so a lot of data will be lost.

On the flip side, in a scenario where both tags are bugged in the ACTIVE 
(1) state, you would have: duplicated CDRs (along with maybe some DB 
error logs due to conflicting unique keys) and possibly extra-counted 
calls, leading to a reduction of the maximally supported CC/CPS.  Assume 
that the platform wasn't even at 50% of the max limits to begin with, 
and the latter has 0 impact on the live system.  Thinking about this, 
this didn't sound that bad at all to us: no data loss, at the expense of 
a few error logs and possibly some added call limits.

So you can see that we went for a design which minimizes any errors that 
the developers can make, and protects the overall system.  The platform 
will work decently, regardless of network conditions or how the 
tag-managing MI commands are sent or abused.

How to automatically recover from the ACTIVE/ACTIVE sharing tag state
---------------------------------------------------------------------

Given that the "clusterer_shtag_set_active" [1] MI command issued to a 
node will force all other nodes to transition from ACTIVE -> BACKUP, you 
could enhance your system with a logic that sends this command to the 
opposite node any time a node's VIP performs the ACTIVE -> BACKUP 
transition.  This should fix the original problem, where both tags end 
up in the ACTIVE state, due to the link between nodes being temporarily 
down, without any of the OS'es necessarily being down.

PS: we haven't implemented the above ^ ourselves yet, but it should work 
in theory :) let me know if it works for you if you do decide to plug 
this rare issue for your setup!

Best regards,

[1]: 
https://opensips.org/docs/modules/3.1.x/clusterer#mi_clusterer_shtag_set_active

-- 
Liviu Chircu
www.twitter.com/liviuchircu | www.opensips-solutions.com

OpenSIPS Summit, Amsterdam, May 2020
   www.opensips.org/events