[OpenSIPS-Users] dialog-dlg-repl capability State=not synced in a 2-node cluster, but dialog-prof-repl State=Ok

Hari Radhakrishnan harirkrishnan at gmail.com
Mon Jun 3 16:09:01 EDT 2019


Hi Opensips Team,

We are facing an issue with a 2-node opensips cluster in opensips-2.4
running on Ubuntu.
We would be really grateful if you could help with this. We have spent a
few days putting
DBG statements in the code, but didn't get anywhere significant.

The setup uses db_text module, with two separate files (having same data),
in each node.
The opensips instances are running on AWS EC2 in the same subnet (security
groups are
enabled).

With this setup (please see clusterer_list_cap output below), I get this.
Ideally, we only
require *dialog_replication_cluster* enabled. But if
*profile_replication_cluster* is not enabled,
the dialogs do not get replicated.
    Cluster:: 1

        Capability:: dialog-dlg-repl State=not synced

        Capability:: dialog-prof-repl State=Ok

This is an Active-Passive setup, with keepalived enabled to switch-over the
Virtual IP.

When I enable INFO logs, I see these messages on startup -

*2019-06-04T05:14:53.909986+10:00 edge-opensips
/usr/local/sbin/opensips[6519]: ERROR:dialog:dlg_replicate_profiles: Failed
to replicate dialog profile*

*2019-06-04T05:14:54.011064+10:00 edge-opensips
/usr/local/sbin/opensips[6519]: ERROR:dialog:dlg_replicate_profiles: All
destinations in cluster: 1 are down or probing*

*2019-06-04T05:14:54.011096+10:00 edge-opensips
/usr/local/sbin/opensips[6519]: ERROR:dialog:dlg_replicate_profiles: Failed
to replicate dialog profile*

*2019-06-04T05:14:54.013319+10:00 edge-opensips
/usr/local/sbin/opensips[6414]: INFO:core:probe_max_sock_buff: using snd
buffer of 416 kb*

*2019-06-04T05:14:54.013336+10:00 edge-opensips
/usr/local/sbin/opensips[6414]: INFO:core:init_sock_keepalive: TCP
keepalive enabled on socket 8*

*2019-06-04T05:14:54.014216+10:00 edge-opensips
/usr/local/sbin/opensips[6461]: INFO:clusterer:handle_internal_msg: Node
[2] is UP*


When I enable DBG logs, these messages appear.
We saw that *dialog-dlg-repl*  capability remains in *CAP_SYNC_PENDING
state. *


*2019-06-04T05:39:52.583634+10:00 edge-opensips
/usr/local/sbin/opensips[8867]: DBG:clusterer:cl_register_cap: Registered
capability: dialog-prof-repl*

*2019-06-04T05:39:52.583651+10:00 edge-opensips
/usr/local/sbin/opensips[8867]: DBG:clusterer:cl_register_cap: Registered
capability: dialog-dlg-repl*

*2019-06-04T05:39:52.583659+10:00 edge-opensips
/usr/local/sbin/opensips[8867]: DBG:clusterer:cl_request_sync: requesting
dialog-dlg-repl sync in cluster 1*

*2019-06-04T05:39:52.583668+10:00 edge-opensips
/usr/local/sbin/opensips[8867]: DBG:clusterer:cl_request_sync: failed to
find sync source*

The following logs keep running all the time (right after startup, without
any calls).
It appears that it is retrying to send some message but is stuck in a big
loop
due to a state which does not get changed. Similar logs are seen on the
other
node as well.

*2019-06-04T05:39:53.599460+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: INFO:clusterer:handle_internal_msg: Node
[2] is UP*

*2019-06-04T05:39:53.599467+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:clusterer:match_node: matched node 2*

*2019-06-04T05:39:53.599483+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:tcp_conn_get: con found in state 0*

*2019-06-04T05:39:53.599485+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:tcp_conn_get: tcp connection found
(0x7feee3585a98), acquiring fd*

*2019-06-04T05:39:53.599492+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:tcp_conn_get: c= 0x7feee3585a98,
n=16, Usock=303*

*2019-06-04T05:39:53.599545+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:tcp_conn_get: after receive_fd: c=
0x7feee3585a98 n=8 fd=304*

*2019-06-04T05:39:53.599552+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:proto_bin_send: sending via
fd 304...*

*2019-06-04T05:39:53.599560+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:async_tsend_stream: Async
successful write from first try on 0x7feee3585a98*

*2019-06-04T05:39:53.599562+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:proto_bin_send: after write:
c= 0x7feee3585a98 n/len=48/48 fd=304*

*2019-06-04T05:39:53.599565+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:clusterer:msg_send_retry: sent bin
packet to node [2]*

*2019-06-04T05:39:53.599574+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_read_req: tcp_read_req
end*

*2019-06-04T05:39:53.599689+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_read_req: Using the
global ( per process ) buff*

*2019-06-04T05:39:53.599698+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_parse_headers: received a
COMPLETE message*

*2019-06-04T05:39:53.599700+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_handle_req: completely
received a message*

*2019-06-04T05:39:53.599702+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_handle_req: Nothing more
to read on TCP conn 0x7feee35857e0, currently in state 0*

*2019-06-04T05:39:53.599706+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:call_callbacks: binary Packet CMD:
8. Capability: clusterer-internal*

*2019-06-04T05:39:53.599709+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:clusterer:bin_rcv_cl_packets: received
clusterer message from: 10.100.0.142:41737 <http://10.100.0.142:41737> with
source id: 2 and cluster id: 1*

*2019-06-04T05:39:53.599711+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:clusterer:handle_internal_msg: Received
capability update with source [2]*

*2019-06-04T05:39:53.599713+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:bin_pop_str: Popped:
'dialog-dlg-repl' [15]*

*2019-06-04T05:39:53.599715+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:bin_pop_str: Popped:
'dialog-prof-repl' [16]*

*2019-06-04T05:39:53.599718+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_read_req: tcp_read_req
end*

*2019-06-04T05:39:53.624271+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_read_req: Using the
global ( per process ) buff*

*2019-06-04T05:39:53.624280+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_parse_headers: received a
COMPLETE message*

*2019-06-04T05:39:53.624282+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_handle_req: completely
received a message*

*2019-06-04T05:39:53.624284+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_handle_req: Nothing more
to read on TCP conn 0x7feee35857e0, currently in state 0*

*2019-06-04T05:39:53.624288+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:call_callbacks: binary Packet CMD:
4. Capability: dialog-prof-repl*

*2019-06-04T05:39:53.624291+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:clusterer:bin_rcv_mod_packets: received
bin packet from: 10.100.0.142:41737 <http://10.100.0.142:41737> with source
id: 2 and cluster id: 1*

*2019-06-04T05:39:53.624295+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:core:bin_pop_str: Popped: 'inbound' [7]*

*2019-06-04T05:39:53.624298+10:00 edge-opensips
/usr/local/sbin/opensips[8919]: DBG:proto_bin:bin_read_req: tcp_read_req
end*

*2019-06-04T05:39:53.699828+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:clusterer:match_node: matched node 2*

*2019-06-04T05:39:53.699852+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:core:tcp_conn_get: con found in state 0*

*2019-06-04T05:39:53.699855+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:core:tcp_conn_get: tcp connection found
(0x7feee3585a98), acquiring fd*

*2019-06-04T05:39:53.699863+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:core:tcp_conn_get: c= 0x7feee3585a98,
n=16, Usock=429*

*2019-06-04T05:39:53.599515+10:00 edge-opensips
/usr/local/sbin/opensips[8984]: message repeated 2 times: [
DBG:core:handle_worker: read response= 7feee3585a98, 1, fd -1 from 52
(8919)]*

*2019-06-04T05:39:53.699885+10:00 edge-opensips
/usr/local/sbin/opensips[8984]: DBG:core:handle_worker: read response=
7feee3585a98, 1, fd -1 from 116 (8983)*

*2019-06-04T05:39:53.699899+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:core:tcp_conn_get: after receive_fd: c=
0x7feee3585a98 n=8 fd=8*

*2019-06-04T05:39:53.699910+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:proto_bin:proto_bin_send: sending via
fd 8...*

*2019-06-04T05:39:53.699952+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:proto_bin:async_tsend_stream: Async
successful write from first try on 0x7feee3585a98*

*2019-06-04T05:39:53.699980+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:proto_bin:proto_bin_send: after write:
c= 0x7feee3585a98 n/len=61/61 fd=8*

*2019-06-04T05:39:53.699994+10:00 edge-opensips
/usr/local/sbin/opensips[8983]: DBG:clusterer:msg_send_retry: sent bin
packet to node [2]*



===================== SETUP ============================

# salt -G hosttype:edge-opensips cmd.run 'opensipsctl fifo clusterer_list'

edge-opensips.10-100-0-142:

    Cluster:: 1

        Node:: 1 DB_ID=-1 URL=bin:10.100.0.141:5555 Enabled=1 Link_state=Up
    Next_hop=1 Description=none

edge-opensips.10-100-0-141:

    Cluster:: 1

        Node:: 2 DB_ID=-1 URL=bin:10.100.0.142:5555 Enabled=1 Link_state=Up
    Next_hop=2 Description=none


# salt -G hosttype:edge-opensips cmd.run 'opensipsctl fifo
clusterer_list_topology'

edge-opensips.10-100-0-141:

    Cluster:: 1

        Node:: 1 Neighbours=2

        Node:: 2 Neighbours=1

edge-opensips.10-100-0-142:

    Cluster:: 1

        Node:: 2 Neighbours=1

        Node:: 1 Neighbours=2



# salt -G hosttype:edge-opensips cmd.run 'opensipsctl fifo
clusterer_list_cap'

edge-opensips.10-100-0-142:

    Cluster:: 1

        Capability:: dialog-dlg-repl State=not synced

        Capability:: dialog-prof-repl State=Ok

edge-opensips.10-100-0-141:

    Cluster:: 1

        Capability:: dialog-dlg-repl State=not synced

        Capability:: dialog-prof-repl State=Ok


# salt -G hosttype:edge-opensips cmd.run 'netstat -anp | grep 5555'

edge-opensips.10-100-0-141:

    tcp        0      0 10.100.0.141:5555       0.0.0.0:*
LISTEN      6409/opensips

    tcp        0      0 10.100.0.141:38272      10.100.0.142:5555
 ESTABLISHED
6526/opensips

    tcp        0      0 10.100.0.141:5555       10.100.0.142:38355
 ESTABLISHED
6461/opensips

edge-opensips.10-100-0-142:

    tcp        0      0 10.100.0.142:5555       0.0.0.0:*
LISTEN      29658/opensips

    tcp        0      0 10.100.0.142:5555       10.100.0.141:38272
 ESTABLISHED
29710/opensips

    tcp        0      0 10.100.0.142:38355      10.100.0.141:5555
 ESTABLISHED
29775/opensips

===========================================================


******************* CONFIG ************************

loadmodule "proto_bin.so"

modparam("proto_bin", "bin_port", 5555)

modparam("proto_bin", "bin_send_timeout", 200)


# CLUSTERER

loadmodule "clusterer.so"

modparam("clusterer", "db_mode", 0)

modparam("clusterer", "db_url", "text:///etc/opensips/db/")

modparam("clusterer", "db_table", "clusterer")

modparam("clusterer", "ping_interval", 4)

modparam("clusterer", "ping_timeout", 1000)   # tried setting higher value

modparam("clusterer", "node_timeout", 60)

modparam("clusterer", "current_id", 1)

modparam("clusterer", "current_info", "cluster_id=1,node_id=1,url=bin:
10.100.0.141:5555")

modparam("clusterer", "neighbor_info", "cluster_id=1,node_id=2,url=bin:
10.100.0.142:5555")


## Dialog Module

loadmodule "dialog.so"

modparam("dialog", "dlg_match_mode", 1) #  DID_FALLBACK best method for now

modparam("dialog", "db_mode", 0)

modparam("dialog", "db_url", "text:///etc/opensips/db/")

modparam("dialog", "profiles_no_value", "inbound/b")

# For dialog replication

modparam("dialog", "profile_replication_cluster", 1)

modparam("dialog", "dialog_replication_cluster", 1)

modparam("dialog", "dlg_sharing_tag", "vip=active")  # set to "vip=backup"
in the other node


---------------------------------


# DB_TEXT

# cat /etc/opensips/db/clusterer

id(int,auto) cluster_id(int) node_id(int) url(string) state(int)
no_ping_retries(int) priority(int) sip_addr(string,null) flags(string,null)
description(string,null)

10:1:1:bin\:10.100.0.141\:5555:1:3:50:NULL:NULL:primary

20:1:2:bin\:10.100.0.142\:5555:1:3:50:NULL:NULL:secondary



**************************************************


best regards,

Hari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20190604/1486c0cf/attachment-0001.html>


More information about the Users mailing list