<div><div dir="auto">Maybe you are hitting the max connections? How many connections are there when it starts to show those errors?</div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 5 Jun 2020 at 01:06, Calvin Ellison <<a href="mailto:calvin.ellison@voxox.com">calvin.ellison@voxox.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">> A) Is the LRN database located locally on the OpenSIPs box or is it remote?<br>

<br>

We are using an F5 BIG-IP to proxy a pool of database servers.<br>

Opensips is showing two connection-related errors:<br>

<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

ERROR:db_mysql:db_mysql_connect: driver error(2013): Lost connection<br>

to MySQL server at 'reading authorization packet', system error: 110<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

ERROR:db_mysql:db_mysql_new_connection: initial connect failed<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

ERROR:core:db_init_async: failed to open new DB connection on<br>

mysql://<a href="http://XXXX:XXXX@10.0.5.38:0/" rel="noreferrer" target="_blank">XXXX:XXXX@10.0.5.38:0/</a><br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection<br>

(current: 1 + 8). Running in sync mode!<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

INFO:db_mysql:switch_state_to_disconnected: disconnect event for<br>

0x7f8903f16d10<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

INFO:db_mysql:reset_all_statements: resetting all statements on<br>

connection: (0x7f8903f16bb0) 0x7f8903f16d10<br>

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:<br>

INFO:db_mysql:connect_with_retry: re-connected successful for<br>

0x7f8903f16d10<br>

<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

ERROR:db_mysql:db_mysql_connect: driver error(2003): Can't connect to<br>

MySQL server on '10.0.5.38' (110)<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

ERROR:db_mysql:db_mysql_new_connection: initial connect failed<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

ERROR:core:db_init_async: failed to open new DB connection on<br>

mysql://<a href="http://XXXX:XXXX@10.0.5.38:0/" rel="noreferrer" target="_blank">XXXX:XXXX@10.0.5.38:0/</a><br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection<br>

(current: 1 + 10). Running in sync mode!<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

INFO:db_mysql:switch_state_to_disconnected: disconnect event for<br>

0x7f8903f16d10<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

INFO:db_mysql:reset_all_statements: resetting all statements on<br>

connection: (0x7f8903f16bb0) 0x7f8903f16d10<br>

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:<br>

INFO:db_mysql:connect_with_retry: re-connected successful for<br>

0x7f8903f16d10<br>

<br>

MariaDB is also showing an error from its perspective:<br>

<br>

2020-06-04 23:40:27 64783 [Warning] Aborted connection 64783 to db:<br>

'unconnected' user: 'anonymous' host: '8.38.42.13' (Got timeout<br>

reading communication packets)<br>

<br>

> B) Have you tried only doing sync database queries? Async introduces some overhead, and I'm not sure if it causes extra database connections to be created. When using sync there is a connection per child process that stays up.<br>

<br>

Using synchronous mode appeared to be causing context switching issues<br>

under heavy load. We specifically moved to async for this reason and<br>

that appeared to reduce the CPU load dramatically. From the docs:<br>

<br>

"Using the asynchronous, "suspend-resume" logic instead of forking a<br>

large number of processes in order to scale also has the advantage of<br>

optimizing system resource usage, increasing its maximal throughput.<br>

By requiring less processes to complete the same amount of work in the<br>

same amount of time, process context switching is minimized and<br>

overall CPU usage is improved. Less processes will also eat up less<br>

system memory."<br>

<br>

I've been tweaking each of the configuration settings I've mentioned,<br>

but without any clear path forward. Would 3.x provide any solutions?<br>

<br>

Is it possible to have too many children or timer partitions, and<br>

starve opensips with context switches? Would that cause connection<br>

issues?<br>

<br>

> C) Does the database have enough memory to contain the LRN and DNC datasets fully in memory? The extra latency for the non-cache hits sent to the database may stack up if the database has to hit disk.<br>

<br>

DB says query response time is like 0.001s and doesn't show any sign<br>

of strain. I'm not personally familiar with the TokuDB engine, but I'm<br>

lead to believe the entire dataset is in memory. I have two DBA triple<br>

checking things. It's possible we're hitting a max connections or open<br>

files limit that's set too low. Sometimes our peak hours include<br>

spikes as well.<br>

<br>

> D) How many child processes are you using now? If you are hitting 100% you may need to increase them.<br>

<br>

Only one hits 100% initially, then they topple over after that. This<br>

seems to be related to the intermittent database connection errors.<br>

We'll see what raising the max connections and ulimits on the server<br>

does. I've also backed off on children and increased the async<br>

connection pool size to result in the same number of total maximum<br>

connections. Presumably this will reduce context switches and timer<br>

delays.<br>

<br>

> E) Are your memcached processes using heavy cpu? If you are caching multiple lists, I've found it helps to use unique memcached instance per list.<br>

<br>

All of the various SIP dips are the same db stored procedure with many<br>

fields in the response. Those fields are cached as a CSV string, so<br>

any cached dip can be used by any other kind of dip. The same call is<br>

likely to use multiple dips, so we should only hit the DB once per<br>

call regardless of how many different dips we apply.<br>

<br>

> F) Look for memory related log messages. If the memory starts getting exhausted you will see defrag messages. This will chew up available computation cycles.<br>

<br>

Both opensips servers and the database have plenty of free memory. How<br>

do I know how much shared and process memory to use? I see warnings<br>

about the reactor size shrinking to a percentage of the process memory<br>

but have no idea what that implies.<br>

<br>

_______________________________________________<br>

Users mailing list<br>

<a href="mailto:Users@lists.opensips.org" target="_blank">Users@lists.opensips.org</a><br>

<a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a><br>

</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Regards,</div><div><br></div>David Villasmil<div>email: <a href="mailto:david.villasmil.work@gmail.com" target="_blank">david.villasmil.work@gmail.com</a></div><div>phone: +34669448337</div></div></div>