[OpenSIPS-Devel] Question on tm timers handling

Bogdan-Andrei Iancu bogdan at opensips.org
Tue Apr 5 14:25:18 UTC 2022


Hi Ovidiu,

Thanks for the joined debugging session here, very very useful. The 
bottleneck was identified by Ovidiu to some locking added like 6 years 
ago [1] in order to fix some race conditions.

I guess that story need to be re-evaluated ;)


[1] 
https://github.com/OpenSIPS/opensips/commit/ea6e1fa7fed29efed6057894a8cc4054308160b3

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   https://www.opensips-solutions.com
OpenSIPS eBootcamp 23rd May - 3rd June 2022
   https://opensips.org/training/OpenSIPS_eBootcamp_2022/

On 4/5/22 4:26 PM, Ovidiu Sas wrote:
> Hello Bogdan,
>
> No typos here :)
>
> Let me rephrase with some examples:
>   - at time n, both tm-timer and tm-utimer are scheduled
>   - tm-timer is scheduled first, followed by tm-utimer
>   - tm-timer starts running (grabbing the .ex_lock) while tm-utimer is
> waiting for the .ex_lock to be released by tm-timer
>   - tm-timer takes 150ms to complete its task
>   - at time n + 100ms, tm-utimer is scheduled again, but because the
> initial run of the tm-utimer is not completed (still waiting for the
> .ex_lock), the warning is issued
>   - after 50ms (time is now n + 150ms) the tm-timer job is completed,
> then the tm-utimer (that was scheduled at time n) takes its turn and
> finishes in 10ms.
>   - now the time is n + 160ms and both tm-timer and tm-utimer jobs are done
>   - at time n + 200ms, tm-utimer is scheduled again and all is back to normal
>
> Hope this clarifies how tm-timer is interfering with tm-utimer.
> The timer that is taking a long time to be processed is the WT-TIMER
> inside tm-timer and this is triggering the warning on the tm-utimer.
> The tm-utimer works fine, but we have this misleading warning on it
> because of the tm-timer.
>
> -ovidiu
>
> On Tue, Apr 5, 2022 at 3:59 AM Bogdan-Andrei Iancu <bogdan at opensips.org> wrote:
>> Hi Ovidiu,
>>
>> A first quick note :). You mentioned the tm_utimer as the problematic
>> one - this is the 100ms based timer, used ONLY for the outbound
>> retransmissions. This conflicts with your later finding on WT_TIMER
>> which actually is on tm_timer, not tm_utimer.
>>
>> So, just some typos here, or ? :)
>>
>> Best regards,
>>
>> Bogdan-Andrei Iancu
>>
>> OpenSIPS Founder and Developer
>>     https://www.opensips-solutions.com
>> OpenSIPS eBootcamp 23rd May - 3rd June 2022
>>     https://opensips.org/training/OpenSIPS_eBootcamp_2022/
>>
>> On 4/4/22 8:16 AM, Ovidiu Sas wrote:
>>> Hello all,
>>>
>>> The tm module handles all it's internal timers via two handlers:
>>>    - timer_routine (second based timers)
>>>    - utimer_routine (100ms based timers)
>>> Each of these routines handles 4 different timers each.
>>> Both routines are very similar in functionality and there is no timer
>>> that is handled by both routines.
>>> Because both routines are protected by the same lock
>>> (timertable[(long)set].ex_lock), these two routines cannot run in
>>> parallel (assuming that we have only one set, i.e. a single
>>> timer_partition).
>>>
>>> In my testing, I noticed that the tm_utimer routine has difficulties
>>> running smoothly.
>>> After doing more testing and some profiling, it looks like the culprit
>>> is the WT_TIMER.
>>> For around 10-15K records in the WT_TIMER detached timer list, we
>>> spend around 3ms to create the list and 200-300ms to
>>> run_handler_for_each. Because of this, the tm_utimer (which is
>>> scheduled to run every 100ms) is blocked by the lock on the first run
>>> and on the second run the scheduler detects that the previous run is
>>> still running (waiting for the lock) and therefore issues the famous
>>> "already scheduled" warning.
>>>
>>> The check_and_split_time_list function has its own locks and then each
>>> handlers operates on its own list (with locks for dealing with cells),
>>> so why do we have the timertable[(long)set].ex_lock?
>>>
>>> I removed the lock, tested with one single timer_partition, then with
>>> two timer_partitions and the performance increased dramatically. Is
>>> there a reason for keeping this lock or is it something that was
>>> inherited and nobody bothered to check why and remove it?
>>>
>>> Thanks,
>>> Ovidiu
>>>
>




More information about the Devel mailing list