[OpenSIPS-Users] Autoscaler in 3.2.x

Mon Sep 12 09:53:41 UTC 2022

Hi Yury,

Maybe you can get a trap output while the procs are in 100% and before 
everything dies ?

Best regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   https://www.opensips-solutions.com
OpenSIPS Summit 27-30 Sept 2022, Athens
   https://www.opensips.org/events/Summit-2022Athens/

On 9/12/22 11:12 AM, Yury Kirsanov wrote:
> Hi Bogdan,
> We've run into another issue, this time I was just restarting OpenSIPS 
> server during busy hours when about ~2500 SIP devices were registering 
> and making calls (even though dialog number was only around 100-200 
> but there were a lot of packets) and I was unable to successfully 
> restart OpenSIPS, it was getting some processes stuck almost 
> immediately at 100% load and then they were starting to consume more 
> and more memory and after eating up all the memory they were dying and 
> OpenSIPS stopped processing SIP packets.
>
> I believe it's similar to autoscaler issue because in this case I only 
> had 16 UDP workers and 16 TCP workers and it was taking more time for 
> OpenSIPS to run into the issue, while when I had autoscaler on it 
> wasn't able to open that many processes at once so currently active 
> ones were getting stuck very fast and crash was happening almost 
> immediately.
>
> I'm running a localhost REDIS cache to store where to proxy each SIP 
> packet to and if there's no record for this SIP device then I'm 
> querying REST server and cache its response. REST server load was no 
> more than 25% during restart when all SIP devices were urgently trying 
> to re-connect to OpenSIPS so I don't think they're of any issue.
>
> I'm using async REST calls and believe there should be no issues with 
> my configuration script even though it runs a lot of nested routes due 
> to async REST requests. Hopefully I didn't forget some 'exit' 
> statements anywhere but if it was the case - OpenSIPS service would be 
> locking up at any time.
>
> OpenSIPS itself is running on a VMWare host as a virtual machine and I 
> could see it was consuming up to 100% CPU of a 40-core host when it 
> was locking up. Also VMWare readyness for VM was spiking to 1500ms 
> during these lock-ups meaning that VM was waiting for some cores to 
> actually free up to get some CPU time.
>
> The only way out of this situation for me was to run multiple OpenSIPS 
> VMs and spread the load between them, no matter what I tried to do 
> I wasn't able to make OpenSIPS running fine again even though it was 
> working perfectly fine for more than a week in this configuration and 
> under same load, but I was starting/restarting it only during night 
> hours when there were no calls active.
>
> I'm happy to share my configuration file with you privately if requred.
>
> Hope this helps!
>
> Thanks and best regards,
> Yury.
>
> On Wed, Sep 7, 2022 at 9:54 PM Bogdan-Andrei Iancu 
> <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>
>     Hi Yury,
>
>     Thanks for the details info here - let me do a review of some code
>     and run some tests, as at this point I have a good idea on the
>     direction to dig into.
>
>     I will update here.
>
>     Best regards,
>
>     Bogdan-Andrei Iancu
>
>     OpenSIPS Founder and Developer
>        https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>     OpenSIPS Summit 27-30 Sept 2022, Athens
>        https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>
>     On 9/6/22 11:24 AM, Yury Kirsanov wrote:
>>     Hi Bogdan,
>>     Yes, I'm listening on all types of sockets including UDP, TCP and
>>     TLS on the outside public interface and then forward traffic into
>>     internal LAN via UDP only.
>>
>>     Previously it was getting stuck quite easily, now I had to wait
>>     for a while before this actually happened. I've routed part of my
>>     customers to this server to obtain this result so I will have to
>>     do that again.
>>
>>     As soon as I see one of the processes stuck I'll dot the trap
>>     command and send you all the details including processes load, ps
>>     output and so on.
>>
>>     For now I had to switch autoscaling off and just create many
>>     listeners. Do I understand correctly that I need to restart
>>     OpenSIPS in order to apply autoscaling profiles and reload-routes
>>     is not sufficient?
>>
>>     Also, do I need separate UDP profiles for public and private
>>     interfaces? And do I need to apply autoscaling profile just to a
>>     socket or I need to specify udp or tcp_workers with autoscaler too?
>>
>>     Thanks and best regards,
>>     Yury.
>>
>>     On Tue, 6 Sept 2022, 18:18 Bogdan-Andrei Iancu,
>>     <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>>
>>         Hi Yury,
>>
>>         Thanks for the info. I see that the stuck process (24) is an
>>         auto-scalled one (based on its id). Do you have SIP traffic
>>         from UDP to TCP or doing some HEP capturing for SIP ? I saw a
>>         recent similar report where a UDP auto-scalled worked got
>>         stuck when trying to do some communication with the TCP
>>         main/manager process (in order to handle a TCP operation).
>>
>>         BTW, any chance to do a "opensips-cli -x trap" when you have
>>         that stuck process, just to see where is it stuck? and is it
>>         hard to reproduce? as I may ask you to extract some
>>         information from the running process....
>>
>>         Regards,
>>
>>         Bogdan-Andrei Iancu
>>
>>         OpenSIPS Founder and Developer
>>            https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>         OpenSIPS Summit 27-30 Sept 2022, Athens
>>            https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>
>>         On 9/3/22 6:54 PM, Yury Kirsanov wrote:
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20220912/5482940a/attachment-0001.html>