[OpenSIPS-Devel] Question on tm timers handling

Ovidiu Sas osas at voipembedded.com
Wed Apr 13 16:11:02 UTC 2022


Hi Bogdan,

A quick fix here would be to add a new tm param to control the use of the lock.
By default, we should keep the existing behaviour: serial tm timers processing.

-ovidiu

On Tue, Apr 5, 2022 at 10:25 AM Bogdan-Andrei Iancu <bogdan at opensips.org> wrote:
>
> Hi Ovidiu,
>
> Thanks for the joined debugging session here, very very useful. The
> bottleneck was identified by Ovidiu to some locking added like 6 years
> ago [1] in order to fix some race conditions.
>
> I guess that story need to be re-evaluated ;)
>
>
> [1]
> https://github.com/OpenSIPS/opensips/commit/ea6e1fa7fed29efed6057894a8cc4054308160b3
>
> Regards,
>
> Bogdan-Andrei Iancu
>
> OpenSIPS Founder and Developer
>    https://www.opensips-solutions.com
> OpenSIPS eBootcamp 23rd May - 3rd June 2022
>    https://opensips.org/training/OpenSIPS_eBootcamp_2022/
>
> On 4/5/22 4:26 PM, Ovidiu Sas wrote:
> > Hello Bogdan,
> >
> > No typos here :)
> >
> > Let me rephrase with some examples:
> >   - at time n, both tm-timer and tm-utimer are scheduled
> >   - tm-timer is scheduled first, followed by tm-utimer
> >   - tm-timer starts running (grabbing the .ex_lock) while tm-utimer is
> > waiting for the .ex_lock to be released by tm-timer
> >   - tm-timer takes 150ms to complete its task
> >   - at time n + 100ms, tm-utimer is scheduled again, but because the
> > initial run of the tm-utimer is not completed (still waiting for the
> > .ex_lock), the warning is issued
> >   - after 50ms (time is now n + 150ms) the tm-timer job is completed,
> > then the tm-utimer (that was scheduled at time n) takes its turn and
> > finishes in 10ms.
> >   - now the time is n + 160ms and both tm-timer and tm-utimer jobs are done
> >   - at time n + 200ms, tm-utimer is scheduled again and all is back to normal
> >
> > Hope this clarifies how tm-timer is interfering with tm-utimer.
> > The timer that is taking a long time to be processed is the WT-TIMER
> > inside tm-timer and this is triggering the warning on the tm-utimer.
> > The tm-utimer works fine, but we have this misleading warning on it
> > because of the tm-timer.
> >
> > -ovidiu
> >
> > On Tue, Apr 5, 2022 at 3:59 AM Bogdan-Andrei Iancu <bogdan at opensips.org> wrote:
> >> Hi Ovidiu,
> >>
> >> A first quick note :). You mentioned the tm_utimer as the problematic
> >> one - this is the 100ms based timer, used ONLY for the outbound
> >> retransmissions. This conflicts with your later finding on WT_TIMER
> >> which actually is on tm_timer, not tm_utimer.
> >>
> >> So, just some typos here, or ? :)
> >>
> >> Best regards,
> >>
> >> Bogdan-Andrei Iancu
> >>
> >> OpenSIPS Founder and Developer
> >>     https://www.opensips-solutions.com
> >> OpenSIPS eBootcamp 23rd May - 3rd June 2022
> >>     https://opensips.org/training/OpenSIPS_eBootcamp_2022/
> >>
> >> On 4/4/22 8:16 AM, Ovidiu Sas wrote:
> >>> Hello all,
> >>>
> >>> The tm module handles all it's internal timers via two handlers:
> >>>    - timer_routine (second based timers)
> >>>    - utimer_routine (100ms based timers)
> >>> Each of these routines handles 4 different timers each.
> >>> Both routines are very similar in functionality and there is no timer
> >>> that is handled by both routines.
> >>> Because both routines are protected by the same lock
> >>> (timertable[(long)set].ex_lock), these two routines cannot run in
> >>> parallel (assuming that we have only one set, i.e. a single
> >>> timer_partition).
> >>>
> >>> In my testing, I noticed that the tm_utimer routine has difficulties
> >>> running smoothly.
> >>> After doing more testing and some profiling, it looks like the culprit
> >>> is the WT_TIMER.
> >>> For around 10-15K records in the WT_TIMER detached timer list, we
> >>> spend around 3ms to create the list and 200-300ms to
> >>> run_handler_for_each. Because of this, the tm_utimer (which is
> >>> scheduled to run every 100ms) is blocked by the lock on the first run
> >>> and on the second run the scheduler detects that the previous run is
> >>> still running (waiting for the lock) and therefore issues the famous
> >>> "already scheduled" warning.
> >>>
> >>> The check_and_split_time_list function has its own locks and then each
> >>> handlers operates on its own list (with locks for dealing with cells),
> >>> so why do we have the timertable[(long)set].ex_lock?
> >>>
> >>> I removed the lock, tested with one single timer_partition, then with
> >>> two timer_partitions and the performance increased dramatically. Is
> >>> there a reason for keeping this lock or is it something that was
> >>> inherited and nobody bothered to check why and remove it?
> >>>
> >>> Thanks,
> >>> Ovidiu
> >>>
> >
>


-- 
VoIP Embedded, Inc.
http://www.voipembedded.com



More information about the Devel mailing list