[OpenSIPS-Users] CPU 100% with TCP

Mon Oct 29 14:19:18 EDT 2018

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due 
mishandled error). Before disabling HEP, try to disable the async 
support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a 
trap for that too, as in the previous case, the deadlock was exclusively 
HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to 
recompile with the DBG_LOCK option - this enables extra code to 
debug/troubleshoot locking related issues - are you able to do it?

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
   http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:
>
> Bogdan,
>
> Actually, yes we do. Looking back I can see these errors just before 
> the issue occurs:
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send 
> hep message!
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 
> 10.32.163.211:9061 for proto hep_tcp/9 failed
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed 
> first TCP async send : (32) Broken pipe
>
> I will try disabling HEP and see if we can reproduce.
>
> Just for information, I have been reproducing the issue in our testing 
> environment which uses TCP for HEP, however the issue is occurring in 
> our production environment as well which is still using UDP for HEP.
>
> Ben Newlin
>
> *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
> *Date: *Friday, October 26, 2018 at 3:06 AM
> *To: *Ben Newlin <Ben.Newlin at genesys.com>, OpenSIPS users mailling 
> list <users at lists.opensips.org>
> *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
> Hi Ben,
>
> Thank you for the info.
>
> It looks like the processes get stuck into a HEP related internal lock 
> - do you see any HEP related errors in your logs, prior to the dead-lock ?
>
> Also, as PoC, could you disabled HEP tracing to see if the problem 
> goes away ?
>
> Thanks,
>
>
> Bogdan-Andrei Iancu
> OpenSIPS Founder and Developer
>    http://www.opensips-solutions.com
> OpenSIPS Bootcamp 2018
>    http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
> On 10/24/2018 10:18 PM, Ben Newlin wrote:
>
>     Bogdan,
>
>     I have run the command but the output was too large for pastebin
>     so I have sent it to you directly.
>
>     Ben Newlin
>
>     *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
>     <mailto:bogdan at opensips.org>
>     *Date: *Wednesday, October 24, 2018 at 5:17 AM
>     *To: *OpenSIPS users mailling list <users at lists.opensips.org>
>     <mailto:users at lists.opensips.org>, Ben Newlin
>     <Ben.Newlin at genesys.com> <mailto:Ben.Newlin at genesys.com>
>     *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
>     Hi Ben,
>
>     Could you run "opensipsctl trap" ?
>
>     Regards,
>
>
>     Bogdan-Andrei Iancu
>
>       
>
>     OpenSIPS Founder and Developer
>
>        http://www.opensips-solutions.com
>
>     OpenSIPS Bootcamp 2018
>
>        http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
>     On 10/24/2018 12:56 AM, Ben Newlin wrote:
>
>         Hi,
>
>         We have implemented TCP recently and are performing TCP<->UDP
>         translation on one of our proxy types. This proxy only exists
>         for that purpose; there are no DB queries, REST calls, or
>         anything like that. It is designed to be very fast and high
>         throughput.
>
>         Recently we have found that when the remote endpoint of a TCP
>         connection is lost, i.e. the server goes down, while under
>         moderate load OpenSIPS quickly reaches 100% CPU and becomes
>         unresponsive. When this occurs, the “top” command shows that
>         between 30-90% CPU is in System (kernel) space, and each
>         OpenSIPS TCP process shows many times the normal CPU. We are
>         running OpenSIPS 2.4.2 on Amazon Linux.
>
>         I obtained as much information as I could using ps, strace,
>         and gdb here: https://pastebin.com/JP3DnCqs
>         <https://pastebin.com/JP3DnCqs>. We can reproduce the failure
>         consistently by removing a server during call traffic.
>
>         A few things I noticed:
>
>           * The number of running threads reported by OpenSIPS doesn’t
>             align with our configuration, copied here:
>
>         ####### Global Parameters #########
>
>         children=32
>
>         #// Allow 503 to pass back to Control
>
>         disable_503_translation=yes
>
>         #// Even though we are not receiving HEP,
>
>         #// this listener is required by OpenSIPS
>
>         #// in order to use the proto_hep module. :/
>
>         listen=hep_tcp:10.32.40.245:9061 use_children 1
>
>         #// Configure the listeners
>
>         listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
>         listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
>         #// Transaction Module
>
>         loadmodule "tm.so"
>
>         modparam("tm", "restart_fr_on_each_reply", 0)
>
>         modparam("tm", "timer_partitions", 8)
>
>         modparam("tm", "onreply_avp_mode", 1)
>
>         modparam("tm", "wt_timer", 10)
>
>         According to the documentation if “tcp_children” is not set
>         then the value of “children” will be used [1], but we have set
>         “children” to 32 and only have the default 8 TCP processes.
>         Also we appear to only have 1 timer process, although we have
>         set the number of timer partitions to 8.
>
>           * The server that is terminated was using TCP connections
>             exclusively, but all of the CPU seems to be in the UDP
>             threads. The one I looked at appeared to be handling a
>             CANCEL to one of the calls that was active and was
>             attempting to send it out via TCP. I’m not sure why it
>             would be trying to relay the CANCEL as no 100 Trying had
>             been received from the server. I have noticed that in 2.x
>             OpenSIPS will now send CANCELs for transactions even when
>             100 Trying was not received. Is that intentional? RFC 3261
>             states that no CANCEL should be sent unless a provisional
>             response has been received.
>
>         Any assistance with this would be appreciated.
>
>         [1] -
>         http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66
>
>         Ben Newlin
>
>
>
>
>
>         _______________________________________________
>
>         Users mailing list
>
>         Users at lists.opensips.org <mailto:Users at lists.opensips.org>
>
>         http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20181029/685fb25b/attachment-0001.html>