[OpenSIPS-Users] A behavior of Clusterer module during networking issues

Sat Apr 11 16:42:09 EST 2020

Hello Liviu!
And first of all thank you for your detailed explanation.

Now I completely understand the approach was taken when developing this
feature of Clusterer.
And this looks logical to me. It gives less chances to make a human error.

What I did for now, is a clustering super-structure (that works apart
OpenSIPS),
and here what happens when both nodes see each other again (when recovering
from networking issues):

Shared IP remains working on the Master side, and one-shot systemd service
activates
"clusterer_shtag_set_active" on the Master right away.
Thus suppressing a Stand-by node to apply backup state.

For now this schema works perfectly.
Might be I will come up with more robust solution later.. In case this
happens, I will share my experience.

Have a nice day!

On Sat, Apr 11, 2020 at 2:13 PM Liviu Chircu <liviu at opensips.org> wrote:

> On 09.04.2020 15:03, Donat Zenichev wrote:
> > I have question and it's almost theoretic.
> > The question relates to Clusterer module and its behavior.
> >
> > Will Clusterer module solve this contradiction on its own?
> > And if so, to which side the precedence is given?
> >
> > The other way around could be to manually re-activate all services,
> > when all the cluster resumes into normal working process (all nodes
> > are present).
> > Thus this gives us a warranty that shared tag will only be activated
> > on one of the sides.
>
> Hi, Donat!
>
> A very good question and one that we had to answer ourselves when we
> came up with the current design.  To begin with, in your scenario, for
> all OpenSIPS 2.4+ clusterer versions, after the link between the nodes
> comes back online, you will have the following:
>
> * node A: ACTIVE state (holds the VIP), sharing tag state: ACTIVE (1)
> * node B: BACKUP state, sharing tag state: ACTIVE (1)
>
> The main reason behind this inconsistent state is that we did not
> provide an MI command to force a sharing tag to BACKUP (0), which could
> be triggered on node B's transition from ACTIVE -> BACKUP once the link
> is restored, so recovering from this state will not work automatically -
> you have to provide handling for this scenario as well (see last
> paragraph).
>
> Reasoning behind this design
> ----------------------------
>
> Ultimately, our priority was not to get into solving consensus problems,
> Paxos algorithms, etc.  What we wanted is a robust active/backup
> solution which you could flip back and forth with ease, thus achieving
> both High-Availability and easy maintenance.  By not providing a
> "set_sharing_tag vip 0" command, we _avoid_ the situation where, due to
> a developer error, both tags end up being BACKUP (0)!!  In such a
> scenario: there will be no more CDRs and you will be able to run
> infinite CPS/CC through that instance, since all call profile counters
> are equal to 0.  None of the instances take responsibility for any call
> running through them, so a lot of data will be lost.
>
> On the flip side, in a scenario where both tags are bugged in the ACTIVE
> (1) state, you would have: duplicated CDRs (along with maybe some DB
> error logs due to conflicting unique keys) and possibly extra-counted
> calls, leading to a reduction of the maximally supported CC/CPS.  Assume
> that the platform wasn't even at 50% of the max limits to begin with,
> and the latter has 0 impact on the live system.  Thinking about this,
> this didn't sound that bad at all to us: no data loss, at the expense of
> a few error logs and possibly some added call limits.
>
> So you can see that we went for a design which minimizes any errors that
> the developers can make, and protects the overall system.  The platform
> will work decently, regardless of network conditions or how the
> tag-managing MI commands are sent or abused.
>
> How to automatically recover from the ACTIVE/ACTIVE sharing tag state
> ---------------------------------------------------------------------
>
> Given that the "clusterer_shtag_set_active" [1] MI command issued to a
> node will force all other nodes to transition from ACTIVE -> BACKUP, you
> could enhance your system with a logic that sends this command to the
> opposite node any time a node's VIP performs the ACTIVE -> BACKUP
> transition.  This should fix the original problem, where both tags end
> up in the ACTIVE state, due to the link between nodes being temporarily
> down, without any of the OS'es necessarily being down.
>
> PS: we haven't implemented the above ^ ourselves yet, but it should work
> in theory :) let me know if it works for you if you do decide to plug
> this rare issue for your setup!
>
> Best regards,
>
> [1]:
>
> https://opensips.org/docs/modules/3.1.x/clusterer#mi_clusterer_shtag_set_active
>
> --
> Liviu Chircu
> www.twitter.com/liviuchircu | www.opensips-solutions.com
>
> OpenSIPS Summit, Amsterdam, May 2020
>    www.opensips.org/events
>
>
> _______________________________________________
> Users mailing list
> Users at lists.opensips.org
> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>

-- 

Best regards,
Donat Zenichev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20200411/0339cdeb/attachment-0001.html>