[OpenSIPS-Users] Fine tuning high CPS and msyql queries
Alex Balashov
abalashov at evaristesys.com
Tue Jun 16 21:33:35 EST 2020
Hi Calvin,
I'm really glad you were able to get things sorted out, and I apologise
if the thread got testy. I do appreciate your follow-up, which I think
will benefit readers looking for similar answers.
A few inline thoughts:
On 6/15/20 4:04 PM, Calvin Ellison wrote:
> I attempted to reproduce the original breakdown around 3000 CPS using
> the default 212992 byte receive buffer and could not, which tells me I
> broke a cardinal rule of load testing and changed more than one thing at
> a time. Also, don't do load testing when tired. I suspect that I had
> also made a change to the sipp scenario recv/sched loops, or I had
> unknowingly broken something while checking out the tuned package.
In several decades of doing backend systems programming, I've not found
tuning Linux kernel defaults to be generally fruitful for improving
throughput to any non-trivial degree. The defaults are sensible for
almost all use-cases, all the more so given modern hardware and
multi-core processors and the rest.
This is in sharp contrast to the conservative defaults some applications
(e.g. Apache, MySQL) ship with on many distributions. I think the idea
behind such conservative settings is to constrain the application so
that in the event of a DDoS or similar event, it does not take over all
available hardware resources, which would impede response and resolution.
But on the kernel settings, the only impactful changes I have ever seen
are minor adjustments to slightly improve very niche server load
problems of a rather global nature (e.g. related to I/O scheduling, NIC
issues, storage, etc). This wasn't that kind of scenario.
In most respects, it just follows from first principles and Occam's
Razor, IMHO. There's no reason for kernels to ship tuned unnecessarily
conservatively to deny average users something on the order of _several
times'_ more performance from their hardware, and any effort to do that
would be readily apparent and, it stands to reason, staunchly opposed.
It therefore also stands to reason that there isn't some silver bullet
or magic setting that unlocks multiplicative performance gains, if only
one just knows the secret sauce or thinks to tweak it--for the simple
reason that if such a tweak existed, it would be systemically
rationalised away, absent a clear and persuasive basis for such an
artificial and contrived limit to exist. I cannot conceive of what such
a basis would look like, and I'd like to think that's not just a failure
of imagination.
Or in other words, it goes with the commonsensical, "If it seems too
good to be true, it is," intuition. The basic fundamentals of the
application, and to a lesser but still very significant extent the
hardware (in terms of its relative homogeneity nowadays), determine
99.9% of the performance characteristics, and matter a thousand times
more than literally anything one can tweak.
> I deeply appreciate Alex's instance that I was wrong and to keep
> digging. I am happy to retract my claim regarding "absolutely terrible
> sysctl defaults". Using synchronous/blocking DB queries, the 8-core
> server reached 14,000 CPS, at which point I declared it fixed and went
> to bed. It could probably go higher: there's only one DB query with a
> <10ms response time, Memcache for the query response, and some logic to
> decide how to respond. There's only a single non-200 final response, so
> it's probably as minimalist as it gets.
I would agree that with such a minimal call processing loop, given a
generous number of CPU cores you shouldn't be terribly limited.
> If anyone else is trying to tune their setup, I think Alex's advice to
> "not run more than 2 * (CPU threads) [children]" is the best place to
> start. I had inherited this project from someone else's work under
> version 1.11 and they had used 128 children. They were using remote DB
> servers with much higher latency than the local DBs we have today, so
> that might have been the reason. Or they were just wrong to being with.
Aye. Barring a workload consisting of exceptionally latent blocking
service queries, there's really not a valid reason to ever have that
many child processes, and even if one does have such a workload, plenty
of reasons to lean on the fundamental latency problem rather than
working around it with more child processes.
With the proviso that I am not an expert in modern-day OpenSIPS
concurrency innards, the common OpenSER heritage prescribes a preforked
worker process pool with SysV shared memory for inter-process
communication (IPC). Like any shared memory space, this requires mutex
locking so that multiple threads (in this case, processes) don't
access/modify the same data structures at the same time* in ways that
step on the others. Because every process holds and waits on these
locks, this model works well when there aren't very many processes and
their path to execution is mostly clear and not especially volatile, and
when as little data is shared as possible. If you add a lot of
processes, then there's a lot of fighting among them for internal locks
and for CPU time, even if the execution cycle per se is fairly
efficient. If you have 16 cores and 128 child processes, those processes
are going to be fighting for those cores if they execute efficiently,
while suffering from some amount of internal concurrency gridlock if
they are not executing efficiently. Thus, 128 is for almost all cases
very far beyond the sweet spot.
By analogy, think of a large multi-lane highway where almost all cars
travel at more or less a constant speed, and, vitally, almost always
stay in their lane, only very seldom making a lane change. As anyone who
has ever been stuck in traffic knows, small speed changes by individual
actors or small groups of cars can set off huge compression waves that
have impact for miles back, and lane changes also have accordion
effects. It's not a perfect analogy by any means, but it kind of conveys
some sense of the general problem of contention. You really want to keep
the "lanes" clear and eliminate all possible sources of friction,
variance, and overlap.
* For exclusion purposes; of course, there's no such thing as truly
simultaneous execution.
> The Description for Asynchronous Statements is extremely tempting and
> was what started me down that path; it might be missing a qualification
> that Async can be an improvement for slow blocking operations, but the
> additional overhead may be a disadvantage for very fast blocking
> operations.
There is indeed a certain amount of overhead in pushing data around
multiple threads, the locking of shared data structures involved in
doing so, etc. For slow, blocking operations, there's nevertheless an
advantage, but if the operations aren't especially blocking, in many
cases all that "async" stuff is just extra overhead.
Asynchronous tricks which deputise notification of the availability of
further work or I/O to the kernel can be pretty efficient, just because
life in kernel space is pretty efficient. But async execution in
user-space requires user-space contrivances that suffer from all the
problems of user-space in turn, so the economics can be really
different. Mileage of course greatly varies with the implementation details.
-- Alex
--
Alex Balashov | Principal | Evariste Systems LLC
Tel: +1-706-510-6800 / +1-800-250-5920 (toll-free)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/
More information about the Users
mailing list