From 16e1b7a1b7f7ffd8a18713e83c8cd72c9ce48e07 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Fri, 29 Nov 2013 16:41:00 -0500 Subject: Fix assorted race conditions in the new timeout infrastructure. Prevent handle_sig_alarm from losing control partway through due to a query cancel (either an asynchronous SIGINT, or a cancel triggered by one of the timeout handler functions). That would at least result in failure to schedule any required future interrupt, and might result in actual corruption of timeout.c's data structures, if the interrupt happened while we were updating those. We could still lose control if an asynchronous SIGINT arrives just as the function is entered. This wouldn't break any data structures, but it would have the same effect as if the SIGALRM interrupt had been silently lost: we'd not fire any currently-due handlers, nor schedule any new interrupt. To forestall that scenario, forcibly reschedule any pending timer interrupt during AbortTransaction and AbortSubTransaction. We can avoid any extra kernel call in most cases by not doing that until we've allowed LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events. Another hazard is that some platforms (at least Linux and *BSD) block a signal before calling its handler and then unblock it on return. When we longjmp out of the handler, the unblock doesn't happen, and the signal is left blocked indefinitely. Again, we can fix that by forcibly unblocking signals during AbortTransaction and AbortSubTransaction. These latter two problems do not manifest when the longjmp reaches postgres.c, because the error recovery code there kills all pending timeout events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal mask is restored. So errors thrown outside any transaction should be OK already, and cleaning up in AbortTransaction and AbortSubTransaction should be enough to fix these issues. (We're assuming that any code that catches a query cancel error and doesn't re-throw it will do at least a subtransaction abort to clean up; but that was pretty much required already by other subsystems.) Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when disabling that event: if a lock timeout interrupt happened after the lock was granted, the ensuing query cancel is still going to happen at the next CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user cancel. Per reports from Dan Wood. Back-patch to 9.3 where the new timeout handling infrastructure was introduced. We may at some point decide to back-patch the signal unblocking changes further, but I'll desist from that until we hear actual field complaints about it. --- src/backend/tcop/postgres.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) (limited to 'src/backend/tcop/postgres.c') diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 3d74654c827..e6428b69f43 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3803,6 +3803,13 @@ PostgresMain(int argc, char *argv[], * always active, we have at least some chance of recovering from an error * during error recovery. (If we get into an infinite loop thereby, it * will soon be stopped by overflow of elog.c's internal state stack.) + * + * Note that we use sigsetjmp(..., 1), so that this function's signal mask + * (to wit, UnBlockSig) will be restored when longjmp'ing to here. This + * is essential in case we longjmp'd out of a signal handler on a platform + * where that leaves the signal blocked. It's not redundant with the + * unblock in AbortTransaction() because the latter is only called if we + * were inside a transaction. */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) @@ -3823,11 +3830,17 @@ PostgresMain(int argc, char *argv[], /* * Forget any pending QueryCancel request, since we're returning to - * the idle loop anyway, and cancel any active timeout requests. + * the idle loop anyway, and cancel any active timeout requests. (In + * future we might want to allow some timeout requests to survive, but + * at minimum it'd be necessary to do reschedule_timeouts(), in case + * we got here because of a query cancel interrupting the SIGALRM + * interrupt handler.) Note in particular that we must clear the + * statement and lock timeout indicators, to prevent any future plain + * query cancels from being misreported as timeouts in case we're + * forgetting a timeout cancel. */ - QueryCancelPending = false; disable_all_timeouts(false); - QueryCancelPending = false; /* again in case timeout occurred */ + QueryCancelPending = false; /* second to avoid race condition */ /* * Turn off these interrupts too. This is only needed here and not in -- cgit v1.2.3