[OpenSIPS-Users] Fine tuning high CPS and msyql queries

Tue Jun 16 21:33:35 EST 2020

Hi Calvin,

I'm really glad you were able to get things sorted out, and I apologise 
if the thread got testy. I do appreciate your follow-up, which I think 
will benefit readers looking for similar answers.

A few inline thoughts:

On 6/15/20 4:04 PM, Calvin Ellison wrote:

> I attempted to reproduce the original breakdown around 3000 CPS using 
> the default 212992 byte receive buffer and could not, which tells me I 
> broke a cardinal rule of load testing and changed more than one thing at 
> a time. Also, don't do load testing when tired. I suspect that I had 
> also made a change to the sipp scenario recv/sched loops, or I had 
> unknowingly broken something while checking out the tuned package.

In several decades of doing backend systems programming, I've not found 
tuning Linux kernel defaults to be generally fruitful for improving 
throughput to any non-trivial degree. The defaults are sensible for 
almost all use-cases, all the more so given modern hardware and 
multi-core processors and the rest.

This is in sharp contrast to the conservative defaults some applications 
(e.g. Apache, MySQL) ship with on many distributions. I think the idea 
behind such conservative settings is to constrain the application so 
that in the event of a DDoS or similar event, it does not take over all 
available hardware resources, which would impede response and resolution.

But on the kernel settings, the only impactful changes I have ever seen 
are minor adjustments to slightly improve very niche server load 
problems of a rather global nature (e.g. related to I/O scheduling, NIC 
issues, storage, etc). This wasn't that kind of scenario.

In most respects, it just follows from first principles and Occam's 
Razor, IMHO. There's no reason for kernels to ship tuned unnecessarily 
conservatively to deny average users something on the order of _several 
times'_ more performance from their hardware, and any effort to do that 
would be readily apparent and, it stands to reason, staunchly opposed. 
It therefore also stands to reason that there isn't some silver bullet 
or magic setting that unlocks multiplicative performance gains, if only 
one just knows the secret sauce or thinks to tweak it--for the simple 
reason that if such a tweak existed, it would be systemically 
rationalised away, absent a clear and persuasive basis for such an 
artificial and contrived limit to exist. I cannot conceive of what such 
a basis would look like, and I'd like to think that's not just a failure 
of imagination.

Or in other words, it goes with the commonsensical, "If it seems too 
good to be true, it is," intuition. The basic fundamentals of the 
application, and to a lesser but still very significant extent the 
hardware (in terms of its relative homogeneity nowadays), determine 
99.9% of the performance characteristics, and matter a thousand times 
more than literally anything one can tweak.

> I deeply appreciate Alex's instance that I was wrong and to keep 
> digging. I am happy to retract my claim regarding "absolutely terrible 
> sysctl defaults". Using synchronous/blocking DB queries, the 8-core 
> server reached 14,000 CPS, at which point I declared it fixed and went 
> to bed. It could probably go higher: there's only one DB query with a 
> <10ms response time, Memcache for the query response, and some logic to 
> decide how to respond. There's only a single non-200 final response, so 
> it's probably as minimalist as it gets.

I would agree that with such a minimal call processing loop, given a 
generous number of CPU cores you shouldn't be terribly limited.

> If anyone else is trying to tune their setup, I think Alex's advice to 
> "not run more than 2 * (CPU threads) [children]" is the best place to 
> start. I had inherited this project from someone else's work under 
> version 1.11 and they had used 128 children. They were using remote DB 
> servers with much higher latency than the local DBs we have today, so 
> that might have been the reason. Or they were just wrong to being with.

Aye. Barring a workload consisting of exceptionally latent blocking 
service queries, there's really not a valid reason to ever have that 
many child processes, and even if one does have such a workload, plenty 
of reasons to lean on the fundamental latency problem rather than 
working around it with more child processes.

With the proviso that I am not an expert in modern-day OpenSIPS 
concurrency innards, the common OpenSER heritage prescribes a preforked 
worker process pool with SysV shared memory for inter-process 
communication (IPC). Like any shared memory space, this requires mutex 
locking so that multiple threads (in this case, processes) don't 
access/modify the same data structures at the same time* in ways that 
step on the others. Because every process holds and waits on these 
locks, this model works well when there aren't very many processes and 
their path to execution is mostly clear and not especially volatile, and 
when as little data is shared as possible. If you add a lot of 
processes, then there's a lot of fighting among them for internal locks 
and for CPU time, even if the execution cycle per se is fairly 
efficient. If you have 16 cores and 128 child processes, those processes 
are going to be fighting for those cores if they execute efficiently, 
while suffering from some amount of internal concurrency gridlock if 
they are not executing efficiently. Thus, 128 is for almost all cases 
very far beyond the sweet spot.

By analogy, think of a large multi-lane highway where almost all cars 
travel at more or less a constant speed, and, vitally, almost always 
stay in their lane, only very seldom making a lane change. As anyone who 
has ever been stuck in traffic knows, small speed changes by individual 
actors or small groups of cars can set off huge compression waves that 
have impact for miles back, and lane changes also have accordion 
effects. It's not a perfect analogy by any means, but it kind of conveys 
some sense of the general problem of contention. You really want to keep 
the "lanes" clear and eliminate all possible sources of friction, 
variance, and overlap.

* For exclusion purposes; of course, there's no such thing as truly 
simultaneous execution.

> The Description for Asynchronous Statements is extremely tempting and 
> was what started me down that path; it might be missing a qualification 
> that Async can be an improvement for slow blocking operations, but the 
> additional overhead may be a disadvantage for very fast blocking 
> operations.

There is indeed a certain amount of overhead in pushing data around 
multiple threads, the locking of shared data structures involved in 
doing so, etc. For slow, blocking operations, there's nevertheless an 
advantage, but if the operations aren't especially blocking, in many 
cases all that "async" stuff is just extra overhead.

Asynchronous tricks which deputise notification of the availability of 
further work or I/O to the kernel can be pretty efficient, just because 
life in kernel space is pretty efficient. But async execution in 
user-space requires user-space contrivances that suffer from all the 
problems of user-space in turn, so the economics can be really 
different. Mileage of course greatly varies with the implementation details.

-- Alex

-- 
Alex Balashov | Principal | Evariste Systems LLC

Tel: +1-706-510-6800 / +1-800-250-5920 (toll-free)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/