[OpenSIPS-Devel] [opensips] dialog table corrupted when out of shared memory (#311)

miko95 notifications at github.com
Fri Aug 22 11:50:54 CEST 2014


Hi,

We recently encountered a weird issue that corrupted the dialog table of our servers after an instance had no more shared memory.

The different memory errors that appeared in the logs:

```
xxxxxxx[6439]: WARNING:core:fm_malloc: Not enough free memory, will attempt defragmentation.
xxxxxxx[6439]: ERROR:tm:sip_msg_cloner: no more share memory
xxxxxxx[6439]: ERROR:tm:new_t: out of mem
xxxxxxx[6439]: ERROR:tm:t_newtran: new_t failed

xxxxxxx[6450]: WARNING:core:fm_malloc: Not enough free memory, will attempt defragmentation.
xxxxxxx[6450]: ERROR:tm:build_local: no more share memory
xxxxxxx[6450]: ERROR:tm:send_ack: failed to build ACK·
xxxxxxx[6450]: ERROR:tm:reply_received: failed to send ACK (local=no)·
xxxxxxx[6450]: ERROR:dialog:push_reply_in_dialog: missing TAG param in TO hdr :-/·
```

Then, a lot of duplicated dialogs were present in the dialog table. I am sure these duplicated dialogs have been added after the memory error occured. After analyzing the content of the dialog table, I found that most calls had duplicated dialogs as follows: one initial dialog created from the script (with a timestamp at the time it has been created) and multiple duplicates of this dialog (with a timestamp that is after the first error occured). The duplicate dialogs have the same data as the initial dialog except the id (auto increment), the timeout and the timestamp columns. Please note that the dlg_id column of the duplicated dialogs was identical to the initial dialog.

We just started to apply the change of adding the new dlg_id column so the id column was still present and defined as primary key. The dlg_id column wasn't defined as primary key, therefore adding duplicated dialogs didn't generate any error from the database side.

I thought the duplicated dialogs were really created in memory but if it was the case, they would have different dlg_id (the hash_entry would be the same because the CallID is the same, but the hash_id would be different).

This scenario is very bad since our monitoring system detected that opensips doesn't respond and therefore tried to restart it. However, there were more that 300K dialogs (around 5K were good dialogs) in the table and the load_dialog_from_db function that is executed at startup took too much time and memory and during this time opensips wasn't able to respond to incoming request, therefore the monitoring system continued to restart it again and again.

I tried to examine the code to understand what may have caused the duplication but I didn't find anything. I am sure the timer process added each one of the duplicated dialogs since the auto_increment primary key is different for each duplicated dialog.

Regards,
Mickael

---
Reply to this email directly or view it on GitHub:
https://github.com/OpenSIPS/opensips/issues/311
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/devel/attachments/20140822/b81194e1/attachment-0001.htm>


More information about the Devel mailing list