[OpenSIPS-Users] crashing in 2.2.2
Richard Robson
rrobson at greenlightcrm.com
Thu Mar 9 06:05:21 EST 2017
Hi Bogdan,
Yup its under load.
There aren't any errors apart from the WARNING:core:utimer_ticker:
utimer task <tm-utimer> already scheduled for
The live script doesn't have the failover on the 404/50X responses and
copes with the load no problem and never crashes.
It's only when I try and do the failover with the use_next_gw and the
load ramps up to about 1/4 of the normal load. So in testing a made a
few calls and its works fine, but when I put it live its starts crashing
at about 09:15 when the users start coming on line, but our load is
highest after 11:00am so there is load, but not a large amount. It
starts crashing at around 50 concurrent calls and maybe 5 or 6 cps.
I can reproduce it on our test server, but it will disrupt traffic, so
i'd rather do that out of hours, but if I sipp to an invalid number I
can reproduce it but all the cores look similar to me too and have the
reply message in them but i'm no expert at decoding the back traces.
I can get some more Cores for you but I suspect that they will all be
similar. I would have thought though, that would make the debuging easier?
Let me know id you need anything else from me.
Regards,
Richard
On 07/03/2017 21:10, Bogdan-Andrei Iancu wrote:
> Hi Richard,
>
> Sorry for the slow reply - is this crash happing only under some
> ++load ? do you see any errors from OpenSIPS prior to the crash ?
>
> I noticed that the backtraces in the corefiles are similar - how easy
> is for you to reproduce it ?
>
> Regards,
>
> Bogdan-Andrei Iancu
> OpenSIPS Founder and Developer
> http://www.opensips-solutions.com
>
> OpenSIPS Summit May 2017 Amsterdam
> http://www.opensips.org/events/Summit-2017Amsterdam.html
>
> On 03/07/2017 12:28 PM, Richard Robson wrote:
>> Hi,
>>
>> I've gone over the script and as far as I can see its working as
>> expected until the traffic remps up and then opensips crashes.
>>
>> cores:
>> http://pastebin.com/CgN0h40K
>> http://pastebin.com/ay5TS8zD
>> http://pastebin.com/PGn3AqmU
>>
>> Regards,
>>
>> Richard
>>
>> On 06/03/2017 12:14, Richard Robson wrote:
>>> Hi<
>>>
>>> I've tested this on the latest 2.2.3 with the same results.
>>>
>>> http://pastebin.com/Uixb3v8G
>>>
>>> there were a few of these in the logsd too just before the crash:
>>> Mar 5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079270 ms), it may overlap..
>>> Mar 5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079360 ms), it may overlap..
>>> Mar 5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079460 ms), it may overlap..
>>> Mar 5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079560 ms), it may overlap..
>>> Mar 5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079660 ms), it may overlap..
>>> Mar 5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079760 ms), it may overlap..
>>>
>>>
>>> Regards,
>>>
>>> Richard
>>>
>>>
>>>
>>> On 03/03/2017 13:15, Richard Robson wrote:
>>>> More cores
>>>>
>>>> http://pastebin.com/MXW2VBhi
>>>> http://pastebin.com/T7JFAP2U
>>>> http://pastebin.com/u44aaVpWquit
>>>> http://pastebin.com/SFKKcGxE
>>>> http://pastebin.com/dwSgMsJi
>>>> http://pastebin.com/9HdGLm96
>>>>
>>>> I've put 2.2.3 on the dev box now and will try to replicate on that
>>>> box, but its difficult to replicate the traffic artificially. I'll
>>>> try to replicate the fault on the dev box over the weekend. I cant
>>>> do it on the live gateways because it will affect customer traffic.
>>>>
>>>> Regards,
>>>>
>>>> Richard
>>>>
>>>>
>>>> On 03/03/2017 11:28, Richard Robson wrote:
>>>>> I've revisited the gateway failover mechanism I had developed in
>>>>> order to re route calls to the next gateway on 500's due to
>>>>> capacity on the gateways we are using.
>>>>>
>>>>> we have 3 gateways from one carrier and one from another. The 3
>>>>> have 4 cps and will return a 503 or 500 if we breach this. The
>>>>> single gateway from the other carrier has plenty of capacity and
>>>>> should not be a problem so we want to catch this . and route to
>>>>> the next gateway.
>>>>>
>>>>> We are counting the CPS and channel limits and are routing to the
>>>>> next gateway if we exceed the limit set, but There are still
>>>>> occasions where a 5XX is generated, which results in a rejected call.
>>>>>
>>>>>
>>>>> We want to stop these rejected calls and therefore want to
>>>>> implement the failover mechanism for the 5XX responses. For 6
>>>>> months we have been failing over if we think the counts are to
>>>>> high on any one gateway without a problem. But when I implement a
>>>>> failover on a 5XX response opensips starts crashing.
>>>>>
>>>>> It's difficult to generate artificial traffic to mimic the real
>>>>> traffic, but I've not had a problem with the script in testing.
>>>>> Last night I rolled out the new script but by 09:15 this morning
>>>>> opensips started crashing 10 times in 5 minutes. This was as the
>>>>> traffic ramped up. I rolled back the script and it restarted OK
>>>>> and has not crashed since. Therefore the Failover Mechanism in the
>>>>> script is where the crash is happening
>>>>>
>>>>> Core dump: http://pastebin.com/CqnESCm4
>>>>>
>>>>> I'll add more dumps later
>>>>>
>>>>> Regards,
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>> this is the failure route catching the 5XX
>>>>>
>>>>> failure_route[dr_fo] {
>>>>> xlog (" [dr] Recieved reply to method $rm From: $fd, $fn,
>>>>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>>>> if (t_was_cancelled()) {
>>>>> xlog("[dr]call cancelled by internal caller");
>>>>> rtpengine_manage();
>>>>> do_accounting("db", "cdr|missed");
>>>>> exit;
>>>>> }
>>>>>
>>>>> if ( t_check_status("[54]03")) {
>>>>> route(relay_failover);
>>>>> }
>>>>> if ( t_check_status("500")) {
>>>>> route(relay_failover);
>>>>> }
>>>>>
>>>>> do_accounting("db", "cdr|missed");
>>>>> rtpengine_manage();
>>>>> exit;
>>>>> }
>>>>>
>>>>> This is the route taken on the failure
>>>>>
>>>>>
>>>>> route[relay_failover]{
>>>>>
>>>>> if (use_next_gw()) {
>>>>> xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd");
>>>>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>>>>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>>>>
>>>>> ####### check channel limit ######
>>>>> get_profile_size("outbound","$rd","$var(size)");
>>>>> xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd var(size) = $var(size)");
>>>>> xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>>>> xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd result = ( $var(size) > $avp(trunkchannellimit))");
>>>>> if ( $(var(size){s.int}) >
>>>>> $(avp(trunkchannellimit){s.int})) {
>>>>> xlog("[relay_failover-route] Trunk $rd
>>>>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>>>> route(relay_failover);
>>>>> }
>>>>> } else {
>>>>> send_reply("503", "Gateways Exhusted");
>>>>> exit;
>>>>> }
>>>>>
>>>>> ##### We need to check Rate Limiting #######
>>>>> if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
>>>>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>>>> rl_dec_count("$rd"); # decrement the counter since
>>>>> we've not "used" one
>>>>> xlog("[ratelimiter-route] [Max CPS:
>>>>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call
>>>>> to: $rU from: $fU CPS exceeded, delaying");
>>>>> $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>>>> async(usleep("200000"),relay_failover_delay);
>>>>> xlog ("Should not get here!!!! after async requst");
>>>>> } else {
>>>>> xlog ("[relay_outbound-route] [Max CPS:
>>>>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU
>>>>> from: $fU not ratelimited");
>>>>> }
>>>>>
>>>>> t_on_failure("dr_fo");
>>>>> do_accounting("db", "cdr|missed");
>>>>> rtpengine_manage();
>>>>> if (!t_relay()) {
>>>>> xlog("[relay-route] ERROR: Unable to relay");
>>>>> send_reply("500","Internal Error");
>>>>> exit;
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
--
Richard Robson
Greenlight Support
01382 843843
support at greenlightcrm.com
More information about the Users
mailing list