[OpenSIPS-Users] Fine tuning high CPS and msyql queries

Fri Jun 5 00:06:00 EST 2020

> A) Is the LRN database located locally on the OpenSIPs box or is it remote?

We are using an F5 BIG-IP to proxy a pool of database servers.
Opensips is showing two connection-related errors:

Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
ERROR:db_mysql:db_mysql_connect: driver error(2013): Lost connection
to MySQL server at 'reading authorization packet', system error: 110
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
ERROR:db_mysql:db_mysql_new_connection: initial connect failed
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
ERROR:core:db_init_async: failed to open new DB connection on
mysql://XXXX:XXXX@10.0.5.38:0/
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection
(current: 1 + 8). Running in sync mode!
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
INFO:db_mysql:switch_state_to_disconnected: disconnect event for
0x7f8903f16d10
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
INFO:db_mysql:reset_all_statements: resetting all statements on
connection: (0x7f8903f16bb0) 0x7f8903f16d10
Jun  4 10:41:48 TC-521 /usr/sbin/opensips[12318]:
INFO:db_mysql:connect_with_retry: re-connected successful for
0x7f8903f16d10

Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
ERROR:db_mysql:db_mysql_connect: driver error(2003): Can't connect to
MySQL server on '10.0.5.38' (110)
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
ERROR:db_mysql:db_mysql_new_connection: initial connect failed
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
ERROR:core:db_init_async: failed to open new DB connection on
mysql://XXXX:XXXX@10.0.5.38:0/
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection
(current: 1 + 10). Running in sync mode!
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
INFO:db_mysql:switch_state_to_disconnected: disconnect event for
0x7f8903f16d10
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
INFO:db_mysql:reset_all_statements: resetting all statements on
connection: (0x7f8903f16bb0) 0x7f8903f16d10
Jun  4 10:44:29 TC-521 /usr/sbin/opensips[12342]:
INFO:db_mysql:connect_with_retry: re-connected successful for
0x7f8903f16d10

MariaDB is also showing an error from its perspective:

2020-06-04 23:40:27 64783 [Warning] Aborted connection 64783 to db:
'unconnected' user: 'anonymous' host: '8.38.42.13' (Got timeout
reading communication packets)

> B) Have you tried only doing sync database queries? Async introduces some overhead, and I'm not sure if it causes extra database connections to be created. When using sync there is a connection per child process that stays up.

Using synchronous mode appeared to be causing context switching issues
under heavy load. We specifically moved to async for this reason and
that appeared to reduce the CPU load dramatically. From the docs:

"Using the asynchronous, "suspend-resume" logic instead of forking a
large number of processes in order to scale also has the advantage of
optimizing system resource usage, increasing its maximal throughput.
By requiring less processes to complete the same amount of work in the
same amount of time, process context switching is minimized and
overall CPU usage is improved. Less processes will also eat up less
system memory."

I've been tweaking each of the configuration settings I've mentioned,
but without any clear path forward. Would 3.x provide any solutions?

Is it possible to have too many children or timer partitions, and
starve opensips with context switches? Would that cause connection
issues?

> C) Does the database have enough memory to contain the LRN and DNC datasets fully in memory? The extra latency for the non-cache hits sent to the database may stack up if the database has to hit disk.

DB says query response time is like 0.001s and doesn't show any sign
of strain. I'm not personally familiar with the TokuDB engine, but I'm
lead to believe the entire dataset is in memory. I have two DBA triple
checking things. It's possible we're hitting a max connections or open
files limit that's set too low. Sometimes our peak hours include
spikes as well.

> D) How many child processes are you using now? If you are hitting 100% you may need to increase them.

Only one hits 100% initially, then they topple over after that. This
seems to be related to the intermittent database connection errors.
We'll see what raising the max connections and ulimits on the server
does. I've also backed off on children and increased the async
connection pool size to result in the same number of total maximum
connections. Presumably this will reduce context switches and timer
delays.

> E) Are your memcached processes using heavy cpu? If you are caching multiple lists, I've found it helps to use unique memcached instance per list.

All of the various SIP dips are the same db stored procedure with many
fields in the response. Those fields are cached as a CSV string, so
any cached dip can be used by any other kind of dip. The same call is
likely to use multiple dips, so we should only hit the DB once per
call regardless of how many different dips we apply.

> F) Look for memory related log messages. If the memory starts getting exhausted you will see defrag messages. This will chew up available computation cycles.

Both opensips servers and the database have plenty of free memory. How
do I know how much shared and process memory to use? I see warnings
about the reactor size shrinking to a percentage of the process memory
but have no idea what that implies.