[OpenSIPS-Devel] Allocating TCP workers to process requests

Mon Jan 20 10:47:53 EST 2020

Dan,

Based on what you have so far it is hard to spot the problem.

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   https://www.opensips-solutions.com
OpenSIPS Summit, Amsterdam, May 2020
   https://www.opensips.org/events/Summit-2020Amsterdam/
OpenSIPS Bootcamp, Miami, March 2020
   https://opensips.org/training/OpenSIPS_Bootcamp_2020/

On 1/20/20 4:50 PM, Dan Pascu wrote:
> On 20 Jan 2020, at 10:12, Bogdan-Andrei Iancu wrote:
>
>> Hi Dan,
>>
>> If you are able to reproduce the issue, while having OpenSIPS "blocked" do "trap" and "mi get_statistics load:" in cli. Let's see exactly what is going on. Eventually run these commands at 1 min intervals (considering that you mentioned the 7.5 mins blocking time).
> I observed this behavior when my main DNS server accidentally died. In order to reproduce it, I'd have to voluntarily kill it, which I don't want to do as this is the main DNS server for all the domains we use, not just some test server. The only other alternative would be o setup a whole parallel platform with separate domains and name servers and proxies just to test this which is something I do not have the time for, just to make this test.
>
> I posted this hoping that someone could take a look at the TCP worker allocation and spot if there is something wrong with it.
>
>> Regards,
>>
>> Bogdan-Andrei Iancu
>>
>> OpenSIPS Founder and Developer
>>   https://www.opensips-solutions.com
>> OpenSIPS Summit, Amsterdam, May 2020
>>   https://www.opensips.org/events/Summit-2020Amsterdam/
>> OpenSIPS Bootcamp, Miami, March 2020
>>   https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>
>> On 1/17/20 3:27 PM, Dan Pascu wrote:
>>> On 17 Jan 2020, at 12:19, Bogdan-Andrei Iancu wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> So, 29 of the SUBSCRIBE requests are forwarded via TCP to another servers, let's call it S1.
>>>>
>>>> Now, looking again at the blocking stage:  if the UAC tries to send new traffic via the same TCP conn, it is clear it will block, as the conn, on the OpenSIPS side, is blocked by the DNS queries.
>>> Figured that much.
>>>
>>>> Still, opening a new TCP conn should do it - the conn will be assigned to the next TCP worker process.
>>> That's what I thought, but not what I observed. During the busy window nothing worked except the RLS trying to send using 1 worker, while 4 other workers were idle and no other incoming traffic was processed.
>>>
>>>> I think it is very unlikely (still not impossible) to have this new conn assigned also to the same proc as the first one (the blocked one). Have you tried checking the internal load of the TCP processes ("get_statistics load:") ?
>>> It's in my original post. The load was 100%/100%/77% during the busy window.
>>>
>>> My question is why is it still possible (even if unlikely) to allocate some job to this TCP worker that is already busy. When a request comes in and a worker is selected to process it, isn't that worker marked as busy, so it's not used by anything else until it finishes processing the current request, when it should be released back to the pool and only then be able to be used for something else?
>>>
>>>> Regards,
>>>>
>>>> Bogdan-Andrei Iancu
>>>>
>>>> OpenSIPS Founder and Developer
>>>>   https://www.opensips-solutions.com
>>>> OpenSIPS Summit, Amsterdam, May 2020
>>>>   https://www.opensips.org/events/Summit-2020Amsterdam/
>>>> OpenSIPS Bootcamp, Miami, March 2020
>>>>   https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>>>
>>>> On 1/16/20 5:23 PM, Dan Pascu wrote:
>>>>> On 15 Jan 2020, at 15:52, Bogdan-Andrei Iancu wrote:
>>>>>
>>>>>> Hi Dan,
>>>>>>
>>>>>> OK, so the Presence Server is on a different instance than the RLS server and the routing between is DNS based  - thanks for clarifying this for me.
>>>>> No. The server handles both RLS and presence, but the contacts for the account are not on this server they are handled by remote servers. So when the RLS subscribe comes in, it finds the 30 contacts out of which only 1 is local. The other 29 are on some other server. But overall it doesn't matter. Because dns=yes in config, even for the local contact it will still to DNS lookups before forwarding it to itself.
>>>>>
>>>>>> Again, as a fist step, I'm trying to understand the scenario you have (including how RLS works) so I can understand what is going on inside of OpenSIPS.
>>>>>>
>>>>>> So we have the 30 SUBSCRIBE requests being sent out from one TCP worker process. And due the DNS failure, this takes 7.5 minutes. And you claim that during these 7.5 minutes OpenSIPS was not able to handle any other traffic ? If so, same questions:
>>>>>>      - the remaining traffic was coming via the same TCP conn as the original received SUBSCRIBE ?
>>>>> Yes and no. Initially was yes, because it was just a client connected over TLS and sending all requests over that. So when I attempted to send an INVITE it was sent over the same TLS connection. But when that failed, I tried to change my outbound proxy to TCP and also tried to dial the address as sip:user at domain;transport=tcp which should have forced a separate TCP connection. But neither worked, the only thing that worked was to use UDP (either through the outbound proxy or explicitly in the URI).
>>>>>
>>>>>>      - if NO on above, the remaining traffic was also RLS related ?
>>>>> No. It was INVITE/REGISTER
>>>>>
>>>>>>      - if NO on above, the remaining traffic also TCP/TLS or was also UDP ?
>>>>> TCP/TLS didn't work in any way during the 7.5 minute busy window, UDP worked.
>>>>>
>>>>>> Just as a fact, there is no kind of dependency/sync between the TCP workers, in a way to block all the other procs while one of them is busy.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Bogdan-Andrei Iancu
>>>>>>
>>>>>> OpenSIPS Founder and Developer
>>>>>>   https://www.opensips-solutions.com
>>>>>> OpenSIPS Summit, Amsterdam, May 2020
>>>>>>   https://www.opensips.org/events/Summit-2020Amsterdam/
>>>>>> OpenSIPS Bootcamp, Miami, March 2020
>>>>>>   https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>>>>>
>>>>>> On 1/15/20 1:13 PM, Dan Pascu wrote:
>>>>>>> On 15 Jan 2020, at 12:07, Bogdan-Andrei Iancu wrote:
>>>>>>>
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> I'm not so familiar with how the RLS works, so bear with me while trying to understand the scenario.
>>>>>>>>
>>>>>>>> In your case, the initial subscribe (to RLS), from the end point, will be received by opensips via one of the TCP workers (the alg for distributing the new TCP conns to a TCP worker is rather simple, filling in the workers from first to last, so yes, the first conn will land all the time into the first worker).
>>>>>>>>
>>>>>>>> Now, while handling this initial SUBSCRIBE (in RLS module), the module will fire the 30 new SUBSCRIBE requests (to the real entities) - these SUBSCRIBEs (I assume) will go via loopback to OpenSIPS again, as presence server - it this correct?
>>>>>>> No. All (except maybe one) will go to a different server (this is a test server and all contacts are in different domains). But even for the one that is local it still does the DNS lookup because I have dns = yes in settings that forces DNS lookups even for local domains.
>>>>>>>
>>>>>>>> if so, is there any DNS lookup involved while these SUBSCRIBEs are looping (I would say no, as RLS uses the `presence_server` param (but it can be FQDN I guess :-/ ) ?
>>>>>>> Yes. See above. There is a DNS lookup (3 actually NAPTR/SRV/A) involved for every destination. That's why it's blocked for 7.5 minutes out of every 10 minutes.
>>>>>>>
>>>>>>>> Aslo, this looping is done via TCP (between opensips RLS and opensips Presence Server) ?
>>>>>>> Yes. TLS actually.
>>>>>>>
>>>>>>> In case I didn't convey my message well, I want to reiterate that I do not see this as an RLS issue (except maybe that RLS could be optimized to use different workers for the different SUBSCRIBEs it sends out, instead of sending all of them from just one worker). The problem I see is that while this 1st worker is occupied for 7.5 minutes out of 10 with RLS, no incoming request is processed, as if it is allocated to the same worker that is already busy with RLS. RLS in this case is just a useful trigger that highlights the problem because it keeps 1 worker busy 75% of the time.
>>>>>>>
>>>>>>> My question is about why are no other incoming requests processed during the 7.5 minutes while RLS keeps the 1st TCP worker busy (because of DNS failures), but I see incoming requests processed when I'm in the 2.5 minute window when RLS is idle and not using TCP worker 1. For me it looks like some allocation problem with TCP workers as there are 4 idle workers, yet the incoming messages are not allocated to any of them.
>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Bogdan-Andrei Iancu
>>>>>>>>
>>>>>>>> OpenSIPS Founder and Developer
>>>>>>>>   https://www.opensips-solutions.com
>>>>>>>> OpenSIPS Summit, Amsterdam, May 2020
>>>>>>>>   https://www.opensips.org/events/Summit-2020Amsterdam/
>>>>>>>> OpenSIPS Bootcamp, Miami, March 2020
>>>>>>>>   https://opensips.org/training/OpenSIPS_Bootcamp_2020/
>>>>>>>>
>>>>>>>> On 1/10/20 1:59 PM, Dan Pascu wrote:
>>>>>>>>> I noticed some unexpected behavior related to how TCP workers are allocated to process requests. This was highlighted during dome DNS outage due to how opensips was configured.
>>>>>>>>>
>>>>>>>>> Here are the relevant bits from my configuration to layout the context:
>>>>>>>>>
>>>>>>>>> I listen on UDP, TCP and TLS and I start 5 UDP and 5 TCP worker processes, but allow them to grow up to 25 based on load:
>>>>>>>>>
>>>>>>>>> listen = udp:IP:5060
>>>>>>>>> listen = tcp:IP:5060
>>>>>>>>> listen = tls:IP:5061
>>>>>>>>>
>>>>>>>>> auto_scaling_profile = SIP_WORKERS
>>>>>>>>>       scale up to 25 on 80% for 4 cycles within 5
>>>>>>>>>       scale down to 5 on 20% for 10 cycles
>>>>>>>>>
>>>>>>>>> tcp_workers = 5 use_auto_scaling_profile SIP_WORKERS
>>>>>>>>> udp_workers = 5 use_auto_scaling_profile SIP_WORKERS
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> DNS is configured to use only 1 server and only 1 attempt with a timeout of 5 seconds per request:
>>>>>>>>>
>>>>>>>>> dns                   = yes
>>>>>>>>> rev_dns               = no
>>>>>>>>> dns_use_search_list   = no
>>>>>>>>> disable_dns_blacklist = yes
>>>>>>>>> dns_retr_time         = 5
>>>>>>>>> dns_retr_no           = 1
>>>>>>>>> dns_servers_no        = 1
>>>>>>>>>
>>>>>>>>> This means that every time a domain is looked up but the DNS server is down, it will do 3 requests (NAPTR, SRV and A) and each will take 5 seconds to timeout. In other words a DNS lookup for a domain will timeout after 15 seconds.
>>>>>>>>>
>>>>>>>>> I have 1 device that connects over TLS and registers an account that uses RLS and has 30 contacts stored.
>>>>>>>>>
>>>>>>>>> Now the event was that the main DNS server was down, and because of my configuration I didn't fallback to the secondary one from resolv.conf so all DNS requests failed.
>>>>>>>>>
>>>>>>>>> During this time I noticed that whenever RLS kicked in it would attempt to send SUBSCRIBEs to the 30 contacts and fail for each of them, and the whole thing would take approximately 7.5 minutes, during which time it would always use the 1st TCP worker which would increase it's load and 1 minute load to 100% and the 10 minutes load would stay at 77%. This was in line with the fact that RLS was triggered every 10 minutes and spend 7.5 minutes stuck in DNS timeouts, so it was busy approximately 75% of the time.
>>>>>>>>>
>>>>>>>>> The fact that RLS always used TCP worker 1 is not unexpected as the SIP device I mentioned was the only one connected to the proxy and the only one sending requests, so the proxy was mostly idle doing RLS every 10 minutes, besides the occasional REGISTER/SUBSCRIBE from the device.
>>>>>>>>>
>>>>>>>>> The unexpected behavior is that during the 7.5 minutes when RLS tried to send SUBSCRIBEs to the contacts, any REGISTER or INVITE received by the proxy would not be processed. They seem to be scheduled on the same 1st TCP worker that is already loaded 100% from the RLS processing that is going on. I never see any log message from my script about processing the REGISTER or INVITE and they just timeout on the client. If I send the REGISTER or INVITE during the 2.5 minutes when RLS is not trying to send SUBSCRIBEs to the contacts, then I see the REGISTER and INVITE being processed and logging from the script, but the INVITE also fails due to DNS failure.
>>>>>>>>>
>>>>>>>>> If I change my outbound proxy to prefer UDP, then I see the REGISTER and INVITE being processed, but if I use TCP or TLS I do not see them being processed unless I'm in the 2.5 minute window when the proxy is not doing RLS (actually I never checked but it's possible that the requests that arrived in the 7.5 minute window were actually processed and logged when the RLS processing window ended, but I never waited that long and they always timeout out on the client in 30 seconds).
>>>>>>>>>
>>>>>>>>> Now I can understand that RLS does all in a single worker (it does a database lookup for the contacts and then loops all of them trying to send a SUBSCRIBE for each), even though it could be argued that it could be optimized to delegate each sending out to a different worker.
>>>>>>>>>
>>>>>>>>> What puzzles me is why is opensips allocating the incoming requests it receives to a TCP worker that is already busy and shows a 100% load in opensips-cli, while it has 4 other TCP workers that are completely idle. Or if my conclusion is wrong, what exactly happens that during the 7.5 minutes where RLS uses TCP worker 1 trying to send out the SUBSCRIBEs and failing, that no incoming request is processed by the other 4 idle TCP workers and it just times out?
>>>>>>>>>
>>>>>>>>> That is not to say that I do not see the other TCP worker's pid in syslog at all, but they only appear very rarely and the idle workers do not seem to ever be used during the 7.5 minute busy window when the 1st worker is 100% loaded. So some worker allocation seems to happen when processing multiple incoming requests that arrive in parallel, but while RLS is sending out the SUBSCRIBEs it never seems to try to use the idle workers for incoming requests.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Devel mailing list
>>>>>>>>> Devel at lists.opensips.org
>>>>>>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/devel
>>>>>>> --
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>>
>>> --
>>> Dan
>>>
>>>
>>>
>>>
>
> --
> Dan
>
>
>
>