[OpenSIPS-Users] CPU 100% with TCP
Bogdan-Andrei Iancu
bogdan at opensips.org
Thu Nov 1 13:29:40 EDT 2018
Hi Ben,
First be sure you have the DBG_LOCK option compiled in. Do the "opensips
-V" and see the output flags.
Next step will be to force an SIGSEGV to opensips (killall -11 opensips)
when the deadlockoccurs - I need a core file to inspect (assuming that
runtime inspection with gdb is not possible).
Regards,
Bogdan-Andrei Iancu
OpenSIPS Founder and Developer
http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/31/2018 09:07 PM, Ben Newlin wrote:
>
> Bogdan,
>
> For the first test I have done as you suggested and disabled only
> async operation for HEP, so it is still using TCP. I will send you the
> trap info directly as it is too large. I also compiled with the
> DBG_LOCK option, but am unsure whether that extra information will be
> available in the trap output or do you need something else?
>
> I am now going to switch HEP to use UDP to mirror our production
> environment and try to reproduce again. Wish me luck! ☺
>
> Ben Newlin
>
> *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
> *Date: *Monday, October 29, 2018 at 2:19 PM
> *To: *Ben Newlin <Ben.Newlin at genesys.com>, OpenSIPS users mailling
> list <users at lists.opensips.org>
> *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
> Hi Ben,
>
> I checked the error trace and it should not leave any dangling lock
> (due mishandled error). Before disabling HEP, try to disable the async
> support for HEP.
>
> If you claim that the same 100% CPU happens with HEP + UDP, send me a
> trap for that too, as in the previous case, the deadlock was
> exclusively HEP + TCP related.
>
> Anyhow, as the original trap showed a deadlock, next step will be to
> recompile with the DBG_LOCK option - this enables extra code to
> debug/troubleshoot locking related issues - are you able to do it?
>
> Regards,
>
> Bogdan-Andrei Iancu
> OpenSIPS Founder and Developer
> http://www.opensips-solutions.com
> OpenSIPS Bootcamp 2018
> http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
> On 10/26/2018 04:14 PM, Ben Newlin wrote:
>
> Bogdan,
>
> Actually, yes we do. Looking back I can see these errors just
> before the issue occurs:
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot
> send hep message!
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to
> 10.32.163.211:9061 for proto hep_tcp/9 failed
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send
>
> Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed
> first TCP async send : (32) Broken pipe
>
> I will try disabling HEP and see if we can reproduce.
>
> Just for information, I have been reproducing the issue in our
> testing environment which uses TCP for HEP, however the issue is
> occurring in our production environment as well which is still
> using UDP for HEP.
>
> Ben Newlin
>
> *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
> <mailto:bogdan at opensips.org>
> *Date: *Friday, October 26, 2018 at 3:06 AM
> *To: *Ben Newlin <Ben.Newlin at genesys.com>
> <mailto:Ben.Newlin at genesys.com>, OpenSIPS users mailling list
> <users at lists.opensips.org> <mailto:users at lists.opensips.org>
> *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
> Hi Ben,
>
> Thank you for the info.
>
> It looks like the processes get stuck into a HEP related internal
> lock - do you see any HEP related errors in your logs, prior to
> the dead-lock ?
>
> Also, as PoC, could you disabled HEP tracing to see if the problem
> goes away ?
>
> Thanks,
>
>
>
> Bogdan-Andrei Iancu
>
>
>
> OpenSIPS Founder and Developer
>
> http://www.opensips-solutions.com
>
> OpenSIPS Bootcamp 2018
>
> http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
> On 10/24/2018 10:18 PM, Ben Newlin wrote:
>
> Bogdan,
>
> I have run the command but the output was too large for
> pastebin so I have sent it to you directly.
>
> Ben Newlin
>
> *From: *Bogdan-Andrei Iancu <bogdan at opensips.org>
> <mailto:bogdan at opensips.org>
> *Date: *Wednesday, October 24, 2018 at 5:17 AM
> *To: *OpenSIPS users mailling list <users at lists.opensips.org>
> <mailto:users at lists.opensips.org>, Ben Newlin
> <Ben.Newlin at genesys.com> <mailto:Ben.Newlin at genesys.com>
> *Subject: *Re: [OpenSIPS-Users] CPU 100% with TCP
>
> Hi Ben,
>
> Could you run "opensipsctl trap" ?
>
> Regards,
>
>
>
> Bogdan-Andrei Iancu
>
>
>
> OpenSIPS Founder and Developer
>
> http://www.opensips-solutions.com
>
> OpenSIPS Bootcamp 2018
>
> http://opensips.org/training/OpenSIPS_Bootcamp_2018/
>
> On 10/24/2018 12:56 AM, Ben Newlin wrote:
>
> Hi,
>
> We have implemented TCP recently and are performing
> TCP<->UDP translation on one of our proxy types. This
> proxy only exists for that purpose; there are no DB
> queries, REST calls, or anything like that. It is designed
> to be very fast and high throughput.
>
> Recently we have found that when the remote endpoint of a
> TCP connection is lost, i.e. the server goes down, while
> under moderate load OpenSIPS quickly reaches 100% CPU and
> becomes unresponsive. When this occurs, the “top” command
> shows that between 30-90% CPU is in System (kernel) space,
> and each OpenSIPS TCP process shows many times the normal
> CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.
>
> I obtained as much information as I could using ps,
> strace, and gdb here: https://pastebin.com/JP3DnCqs
> <https://pastebin.com/JP3DnCqs>. We can reproduce the
> failure consistently by removing a server during call traffic.
>
> A few things I noticed:
>
> * The number of running threads reported by OpenSIPS
> doesn’t align with our configuration, copied here:
>
> ####### Global Parameters #########
>
> children=32
>
> #// Allow 503 to pass back to Control
>
> disable_503_translation=yes
>
> #// Even though we are not receiving HEP,
>
> #// this listener is required by OpenSIPS
>
> #// in order to use the proto_hep module. :/
>
> listen=hep_tcp:10.32.40.245:9061 use_children 1
>
> #// Configure the listeners
>
> listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
> listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX
>
> #// Transaction Module
>
> loadmodule "tm.so"
>
> modparam("tm", "restart_fr_on_each_reply", 0)
>
> modparam("tm", "timer_partitions", 8)
>
> modparam("tm", "onreply_avp_mode", 1)
>
> modparam("tm", "wt_timer", 10)
>
> According to the documentation if “tcp_children” is not
> set then the value of “children” will be used [1], but we
> have set “children” to 32 and only have the default 8 TCP
> processes. Also we appear to only have 1 timer process,
> although we have set the number of timer partitions to 8.
>
> * The server that is terminated was using TCP
> connections exclusively, but all of the CPU seems to
> be in the UDP threads. The one I looked at appeared to
> be handling a CANCEL to one of the calls that was
> active and was attempting to send it out via TCP. I’m
> not sure why it would be trying to relay the CANCEL as
> no 100 Trying had been received from the server. I
> have noticed that in 2.x OpenSIPS will now send
> CANCELs for transactions even when 100 Trying was not
> received. Is that intentional? RFC 3261 states that no
> CANCEL should be sent unless a provisional response
> has been received.
>
> Any assistance with this would be appreciated.
>
> [1] -
> http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66
>
> Ben Newlin
>
>
>
>
>
>
> _______________________________________________
>
> Users mailing list
>
> Users at lists.opensips.org <mailto:Users at lists.opensips.org>
>
> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20181101/baea014c/attachment-0001.html>
More information about the Users
mailing list