[OpenSIPS-Devel] [ opensips-Patches-3042432 ] [pua] add a "publish" queue to avoid deadlock

Thu Nov 25 11:30:56 CET 2010

Patches item #3042432, was opened at 2010-08-10 13:55
Message generated for change (Settings changed) made by anca_vamanu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086412&aid=3042432&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: modules
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Private: No
Submitted By: Christophe Sollet (csollet)
Assigned to: Anca Vamanu (anca_vamanu)
Summary: [pua] add a "publish" queue to avoid deadlock

Initial Comment:
Hi,

I'm facing a lot of deadlock situation with pua and pua_usrloc modules as pua maintain a lock on presentity and hope a tm callback will unlock it.
Sadly, the timer process can easily deadlock itself when the reply never get in . For example : 

TimerProcess : pua::HashT_clean --> update_pua  : lock presentity A and init a new transaction
TimerProcess : usrloc --> pua_usrloc --> ul_publish --> send_publish : block trying to lock presentity A

If the new transaction never get a reply, the tm timer will never get a chance to generate a local 408 as the timer process is already busy to get the lock.

I've tried to solve this issue by running the hashT_clean & db_update on separate process without more success. I suppose it may work only by running every timer that have callback to send_publish() on a distinct process.

The attached patch try to solve this issue by adding a publish queue to presentity : if a publish is already waiting for a reply, the requested one is put on hold and will be sent later by the tm callback. This allow update_pua() and send_publish() to let presentity unlocked will tm wait for a reply.

I'm not sure to understand all the side effect this change can lead to. Not even that is the right way to do it. At least, it solved the deadlock issue...

Regards,
Christophe.


----------------------------------------------------------------------

Comment By: Anca Vamanu (anca_vamanu)
Date: 2010-08-10 18:54

Message:
Hi.

I will commit my implementation. It is different from the one posted by
Christophe as I have taken out the second lock completely. 
Unfortunately I still noticed some issues with it and I was not able to
fix them until now. I am leaving in vacation until Monday and this is why I
am putting it out now, as a proof of concept. I still need to fix a bug I
saw in my test, and I will do that next week.

Thanks and regards,
Anca

----------------------------------------------------------------------

Comment By: Alex Hermann (axlh)
Date: 2010-08-10 18:09

Message:
The patch seems to solve another issue i was about to try to solve. I do
have one small cleanup for you:

The attached patch introduces a lock_try primitive. With this you can try
to get a lock. If the lock is unavailabe, the function will return with an
error code. With this function, you can eliminate the waiting_tm field
because waiting packets are already indicated by a non-NULL next_to_publish
field.

(attaching seems impossible with sf.net again, so here it is inline)

Index: opensips-trunk/lock_ops.h
===================================================================

--- opensips-trunk.orig/lock_ops.h	2010-08-10 17:08:58.000000000 +0200
+++ opensips-trunk/lock_ops.h	2010-08-10 17:09:00.000000000 +0200
@@ -81,6 +81,7 @@
 }
 
 #define lock_get(lock) get_lock(lock)
+#define lock_try(lock) tsl(lock)
 #define lock_release(lock) release_lock(lock)
 
 #elif defined USE_PTHREAD_MUTEX
@@ -97,6 +98,7 @@
 }
 
 #define lock_get(lock) pthread_mutex_lock(lock)
+#define lock_try(lock) pthread_mutex_trylock(lock)
 #define lock_release(lock) pthread_mutex_unlock(lock)
 
 
@@ -115,6 +117,7 @@
 }
 
 #define lock_get(lock) sem_wait(lock)
+#define lock_try(lock) sem_trywait(lock)
 #define lock_release(lock) sem_post(lock)
 
 
@@ -194,6 +197,27 @@
 	}
 }
 
+inline static int lock_try(gen_lock_t* lock)
+{
+	struct sembuf sop;
+
+	sop.sem_num=0;
+	sop.sem_op=-1; /* down */
+	sop.sem_flg=IPC_NOWAIT; /* don't block if mutex is unavailable */
+tryagain:
+	if (semop(*lock, &sop, 1)==-1){
+		if (errno==EAGAIN){
+			return -1;
+		else if (errno==EINTR){
+			LM_DBG("signal received while testing a mutex\n");
+			goto tryagain;
+		} else {
+			LM_CRIT("%s (%d)\n", strerror(errno), errno);
+		}
+	}
+	return 0;
+}
+
 inline static void lock_release(gen_lock_t* lock)
 {
 	struct sembuf sop;


----------------------------------------------------------------------

Comment By: Alex Hermann (axlh)
Date: 2010-08-10 15:59

Message:
Can't the issue be solved by running the db and cleanup routines in a
separate timer process. This seems cleaner than using the same timer
process tm is using as it prevents large delays for tm.

Of course, the locking order must be verified and fixed also.

----------------------------------------------------------------------

Comment By: Alex Hermann (axlh)
Date: 2010-08-10 14:58

Message:
It seems it is pua locking day. I have been busy this morning too, trying
to fix the pua locking mess.

The patch from OP seems to contains the same issue i was trying to fix:
nested locking and unlocking in the wrong order:
+		LM_DBG("Try to get hash lock [%d]\n", presentity->hash_index);
+		lock_get(&HashT->p_records[presentity->hash_index].lock);
+		LM_DBG("Got hash lock [%d]\n", presentity->hash_index);
+		LM_DBG("Try to get presentity lock %p\n", presentity);
+		lock_get(&presentity->publ_lock);
+		LM_DBG("Got presentity lock %p\n", presentity);

---- Nested locks must be unlocked in the reverse order. Below, that order
is disturbed

+		lock_release(&HashT->p_records[presentity->hash_index].lock);
+		LM_DBG("Released hash lock [%d]\n", presentity->hash_index);


I tried to solve it by introducing a lock_try primitive. When trying to
lock the publish_lock while holding the hash lock fails, it will unlock the
hash lock, sleep and try again, beginning with relocking the hash lock and
then trying the publish lock again...

The basic idea i have is to use the hash lock only for serializing
manipulation of the presentity struct and the publish lock only for
serializing the order of the publishes instead if mixing their functions.

I was in the process of cleaning up my patches and make them apply to the
current trunk version as i saw this ticket fly by. Unfortunately, i have
been hacking on a bit of an old version, so the cleanup is not finished
yet.


Further note about the presented solution: isn't doing all the presence
stuff in the tm timer going to overload the timer and introduce lags?


----------------------------------------------------------------------

Comment By: Anca Vamanu (anca_vamanu)
Date: 2010-08-10 14:21

Message:
Hi Christophe,

I have also done an implementation with the same solution - just that I am
not holding a queue but only the last send_publish request, since only the
last state matters. I am in the final stages with the implementation, it is
already tested quite thoroughly and it should be ready for commit very
soon.
I will also look at your patch now to see which was your idea and if my
version can be improved.

Thank you,
Anca  

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086412&aid=3042432&group_id=232389