[OpenSIPS-Users] CPU 100% with TCP

Bogdan-Andrei Iancu bogdan at opensips.org
Thu Nov 1 13:29:40 EDT 2018


Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips 
-V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) 
when the deadlockoccurs - I need a core file to inspect (assuming that 
runtime inspection with gdb is not possible).

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
   http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:
>
> Bogdan,
>
> For the first test I have done as you suggested and disabled only 
> async operation for HEP, so it is still using TCP. I will send you the 
> trap info directly as it is too large. I also compiled with the 
> DBG_LOCK option, but am unsure whether that extra information will be 
> available in the trap output or do you need something else?
>
> I am now going to switch HEP to use UDP to mirror our production 
> environment and try to reproduce again. Wish me luck! ☺
>
> Ben Newlin
>
> *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
> *Date: *Monday, October 29, 2018 at 2:19 PM
> *To: *Ben Newlin <Ben.Newlin at genesys.com>, OpenSIPS users mailling 
> list <users at lists.opensips.org>
> *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
> Hi Ben,
>
> I checked the error trace and it should not leave any dangling lock 
> (due mishandled error). Before disabling HEP, try to disable the async 
> support for HEP.
>
> If you claim that the same 100% CPU happens with HEP + UDP, send me a 
> trap for that too, as in the previous case, the deadlock was 
> exclusively HEP + TCP related.
>
> Anyhow, as the original trap showed a deadlock, next step will be to 
> recompile with the DBG_LOCK option - this enables extra code to 
> debug/troubleshoot locking related issues - are you able to do it?
>
> Regards,
>
> Bogdan-Andrei Iancu
> OpenSIPS Founder and Developer
>    http://www.opensips-solutions.com
> OpenSIPS Bootcamp 2018
>    http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
> On 10/26/2018 04:14 PM, Ben Newlin wrote:
>
>     Bogdan,
>
>     Actually, yes we do. Looking back I can see these errors just
>     before the issue occurs:
>
>     Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot
>     send hep message!
>
>     Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to
>     10.32.163.211:9061 for proto hep_tcp/9 failed
>
>     Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send
>
>     Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed
>     first TCP async send : (32) Broken pipe
>
>     I will try disabling HEP and see if we can reproduce.
>
>     Just for information, I have been reproducing the issue in our
>     testing environment which uses TCP for HEP, however the issue is
>     occurring in our production environment as well which is still
>     using UDP for HEP.
>
>     Ben Newlin
>
>     *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
>     <mailto:bogdan at opensips.org>
>     *Date: *Friday, October 26, 2018 at 3:06 AM
>     *To: *Ben Newlin <Ben.Newlin at genesys.com>
>     <mailto:Ben.Newlin at genesys.com>, OpenSIPS users mailling list
>     <users at lists.opensips.org> <mailto:users at lists.opensips.org>
>     *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
>     Hi Ben,
>
>     Thank you for the info.
>
>     It looks like the processes get stuck into a HEP related internal
>     lock - do you see any HEP related errors in your logs, prior to
>     the dead-lock ?
>
>     Also, as PoC, could you disabled HEP tracing to see if the problem
>     goes away ?
>
>     Thanks,
>
>
>
>     Bogdan-Andrei Iancu
>
>       
>
>     OpenSIPS Founder and Developer
>
>        http://www.opensips-solutions.com
>
>     OpenSIPS Bootcamp 2018
>
>        http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
>     On 10/24/2018 10:18 PM, Ben Newlin wrote:
>
>         Bogdan,
>
>         I have run the command but the output was too large for
>         pastebin so I have sent it to you directly.
>
>         Ben Newlin
>
>         *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
>         <mailto:bogdan at opensips.org>
>         *Date: *Wednesday, October 24, 2018 at 5:17 AM
>         *To: *OpenSIPS users mailling list <users at lists.opensips.org>
>         <mailto:users at lists.opensips.org>, Ben Newlin
>         <Ben.Newlin at genesys.com> <mailto:Ben.Newlin at genesys.com>
>         *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
>         Hi Ben,
>
>         Could you run "opensipsctl trap" ?
>
>         Regards,
>
>
>
>         Bogdan-Andrei Iancu
>
>           
>
>         OpenSIPS Founder and Developer
>
>            http://www.opensips-solutions.com
>
>         OpenSIPS Bootcamp 2018
>
>            http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
>         On 10/24/2018 12:56 AM, Ben Newlin wrote:
>
>             Hi,
>
>             We have implemented TCP recently and are performing
>             TCP<->UDP translation on one of our proxy types. This
>             proxy only exists for that purpose; there are no DB
>             queries, REST calls, or anything like that. It is designed
>             to be very fast and high throughput.
>
>             Recently we have found that when the remote endpoint of a
>             TCP connection is lost, i.e. the server goes down, while
>             under moderate load OpenSIPS quickly reaches 100% CPU and
>             becomes unresponsive. When this occurs, the “top” command
>             shows that between 30-90% CPU is in System (kernel) space,
>             and each OpenSIPS TCP process shows many times the normal
>             CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.
>
>             I obtained as much information as I could using ps,
>             strace, and gdb here: https://pastebin.com/JP3DnCqs
>             <https://pastebin.com/JP3DnCqs>. We can reproduce the
>             failure consistently by removing a server during call traffic.
>
>             A few things I noticed:
>
>               * The number of running threads reported by OpenSIPS
>                 doesn’t align with our configuration, copied here:
>
>             ####### Global Parameters #########
>
>             children=32
>
>             #// Allow 503 to pass back to Control
>
>             disable_503_translation=yes
>
>             #// Even though we are not receiving HEP,
>
>             #// this listener is required by OpenSIPS
>
>             #// in order to use the proto_hep module. :/
>
>             listen=hep_tcp:10.32.40.245:9061 use_children 1
>
>             #// Configure the listeners
>
>             listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
>             listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
>             #// Transaction Module
>
>             loadmodule "tm.so"
>
>             modparam("tm", "restart_fr_on_each_reply", 0)
>
>             modparam("tm", "timer_partitions", 8)
>
>             modparam("tm", "onreply_avp_mode", 1)
>
>             modparam("tm", "wt_timer", 10)
>
>             According to the documentation if “tcp_children” is not
>             set then the value of “children” will be used [1], but we
>             have set “children” to 32 and only have the default 8 TCP
>             processes. Also we appear to only have 1 timer process,
>             although we have set the number of timer partitions to 8.
>
>               * The server that is terminated was using TCP
>                 connections exclusively, but all of the CPU seems to
>                 be in the UDP threads. The one I looked at appeared to
>                 be handling a CANCEL to one of the calls that was
>                 active and was attempting to send it out via TCP. I’m
>                 not sure why it would be trying to relay the CANCEL as
>                 no 100 Trying had been received from the server. I
>                 have noticed that in 2.x OpenSIPS will now send
>                 CANCELs for transactions even when 100 Trying was not
>                 received. Is that intentional? RFC 3261 states that no
>                 CANCEL should be sent unless a provisional response
>                 has been received.
>
>             Any assistance with this would be appreciated.
>
>             [1] -
>             http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66
>
>             Ben Newlin
>
>
>
>
>
>
>             _______________________________________________
>
>             Users mailing list
>
>             Users at lists.opensips.org <mailto:Users at lists.opensips.org>
>
>             http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>
>
>
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20181101/baea014c/attachment-0001.html>


More information about the Users mailing list