<div dir="ltr"><div>A) Is the LRN database located locally on the OpenSIPs box or is it remote?</div><div></div><div>B) Have you tried only doing sync database queries? Async introduces some overhead, and I'm not sure if it causes extra database connections to be created. When using sync there is a connection per child process that stays up.</div><div>C) Does the database have enough memory to contain the LRN and DNC datasets fully in memory? The extra latency for the non-cache hits sent to the database may stack up if the database has to hit disk.</div><div>D) How many child processes are you using now? If you are hitting 100% you may need to increase them.<br></div><div>E) Are your memcached processes using heavy cpu? If you are caching multiple lists, I've found it helps to use unique memcached instance per list.<br></div><div><div>F) Look for memory related log messages. If the memory starts getting exhausted you will see defrag messages. This will chew up available computation cycles.<br></div><div><br></div><div>- Jon Abrams<br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 4, 2020 at 2:17 PM Calvin Ellison <<a href="mailto:calvin.ellison@voxox.com">calvin.ellison@voxox.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The scenario is INVITE -> MySQL query -> non-200 final response. No<br>
calls are connected here, only dipping things like LRN, Do Not Call,<br>
and Wireless/Landline. A similar service runs on a second port,<br>
specific to a different kind of traffic and dip. We're using async<br>
avp_db_query and memcached, with about 3:1 cache hits.<br>
<br>
Our target is up to 10,000 CPS across two opensips servers, which are<br>
dual-CPU Xeon E5620 with 48G RAM. Both are run memcached, and both<br>
servers are using both memcached to share a distributed cache thanks<br>
to this:<br>
'modparam("cachedb_memcached","cachedb_url","memcached:lrn://lrn-d,lrn-e/")'.<br>
At a glance there are over 200mil total cached items, distributed<br>
nearly equally.<br>
<br>
The issue is that individual child processes start getting suck at<br>
100% CPU. Logs indicate connection failures to the MySQL database<br>
causing children to run in sync mode, and there are warnings about<br>
delayed timer jobs tm-timer and blcore-expire. Eventually, the service<br>
becomes unresponsive. Restarting opensips restores service and the<br>
children return to single-digit CPU utilization, but eventually,<br>
children get suck again.<br>
<br>
I'm not certain if the issue is on the database server, or if the<br>
opensips servers are overloaded, or if the config is just not right<br>
yet.<br>
<br>
Is there an established method for fine-tuning these things?<br>
shared memory<br>
process memory<br>
children<br>
db_max_async_connections<br>
listen=... use_children<br>
modparam("tm", "timer_partitions", ?)<br>
<br>
What else is worth considering?<br>
Does a child ever return to async mode after running in sync mode?<br>
How do I know when my servers have reached their limit?<br>
opensips.cfg is available on request.<br>
<br>
version: opensips 2.4.7 (x86_64/linux)<br>
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC,<br>
F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT<br>
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16,<br>
MAX_URI_SIZE 1024, BUF_SIZE 65535<br>
poll method support: poll, epoll, sigio_rt, select.<br>
git revision: 9e1fcc915<br>
main.c compiled on with gcc 7<br>
<br>
*re-built using dpkg-buildpackage including the patch to support DB<br>
floating point types:<br>
<a href="https://opensips.org/pipermail/users/2020-March/042528.html" rel="noreferrer" target="_blank">https://opensips.org/pipermail/users/2020-March/042528.html</a><br>
<br>
$ lsb_release -d<br>
Description: Ubuntu 18.04.4 LTS<br>
<br>
$ uname -a<br>
Linux TC-521 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC<br>
2020 x86_64 x86_64 x86_64 GNU/Linux<br>
<br>
$ free -mw<br>
total used free shared buffers<br>
cache available<br>
Mem: 48281 1085 337 87 1729<br>
45128 46551<br>
<br>
$ lscpu<br>
Architecture: x86_64<br>
CPU op-mode(s): 32-bit, 64-bit<br>
Byte Order: Little Endian<br>
CPU(s): 16<br>
On-line CPU(s) list: 0-15<br>
Thread(s) per core: 2<br>
Core(s) per socket: 4<br>
Socket(s): 2<br>
NUMA node(s): 2<br>
Vendor ID: GenuineIntel<br>
CPU family: 6<br>
Model: 44<br>
Model name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz<br>
Stepping: 2<br>
CPU MHz: 2527.029<br>
BogoMIPS: 4788.05<br>
Virtualization: VT-x<br>
L1d cache: 32K<br>
L1i cache: 32K<br>
L2 cache: 256K<br>
L3 cache: 12288K<br>
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14<br>
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15<br>
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr<br>
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe<br>
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts<br>
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq<br>
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca<br>
sse4_1 sse4_2 popcnt aes lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow<br>
vnmi flexpriority ept vpid dtherm ida arat flush_l1d<br>
<br>
Regards,<br>
<br>
Calvin Ellison<br>
Senior Voice Operations Engineer<br>
<a href="mailto:calvin.ellison@voxox.com" target="_blank">calvin.ellison@voxox.com</a><br>
<br>
_______________________________________________<br>
Users mailing list<br>
<a href="mailto:Users@lists.opensips.org" target="_blank">Users@lists.opensips.org</a><br>
<a href="http://lists.opensips.org/cgi-bin/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.opensips.org/cgi-bin/mailman/listinfo/users</a><br>
</blockquote></div>