postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
*	Fix bogus "out of memory" reports in tuplestore.c.	Tom Lane	2015-08-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tuplesort/tuplestore memory management logic assumed that the chunk allocation overhead for its memtuples array could not increase when increasing the array size. This is and always was true for tuplesort, but we (I, I think) blindly copied that logic into tuplestore.c without noticing that the assumption failed to hold for the much smaller array elements used by tuplestore. Given rather small work_mem, this could result in an improper complaint about "unexpected out-of-memory situation", as reported by Brent DeSpain in bug #13530. The easiest way to fix this is just to increase tuplestore's initial array size so that the assumption holds. Rather than relying on magic constants, though, let's export a #define from aset.c that represents the safe allocation threshold, and make tuplestore's calculation depend on that. Do the same in tuplesort.c to keep the logic looking parallel, even though tuplesort.c isn't actually at risk at present. This will keep us from breaking it if we ever muck with the allocation parameters in aset.c. Back-patch to all supported versions. The error message doesn't occur pre-9.3, not so much because the problem can't happen as because the pre-9.3 tuplestore code neglected to check for it. (The chance of trouble is a great deal larger as of 9.3, though, due to changes in the array-size-increasing strategy.) However, allowing LACKMEM() to become true unexpectedly could still result in less-than-desirable behavior, so let's patch it all the way back.
*	Fix a PlaceHolderVar-related oversight in star-schema planning patch.	Tom Lane	2015-08-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit b514a7460d9127ddda6598307272c701cbb133b7, I changed the planner so that it would allow nestloop paths to remain partially parameterized, ie the inner relation might need parameters from both the current outer relation and some upper-level outer relation. That's fine so long as we're talking about distinct parameters; but the patch also allowed creation of nestloop paths for cases where the inner relation's parameter was a PlaceHolderVar whose eval_at set included the current outer relation and some upper-level one. That does not work. In principle we could allow such a PlaceHolderVar to be evaluated at the lower join node using values passed down from the upper relation along with values from the join's own outer relation. However, nodeNestloop.c only supports simple Vars not arbitrary expressions as nestloop parameters. createplan.c is also a few bricks shy of being able to handle such cases; it misplaces the PlaceHolderVar parameters in the plan tree, which is why the visible symptoms of this bug are "plan should not reference subplan's variable" and "failed to assign all NestLoopParams to plan nodes" planner errors. Adding the necessary complexity to make this work doesn't seem like it would be repaid in significantly better plans, because in cases where such a PHV exists, there is probably a corresponding join order constraint that would allow a good plan to be found without using the star-schema exception. Furthermore, adding complexity to nodeNestloop.c would create a run-time penalty even for plans where this whole consideration is irrelevant. So let's just reject such paths instead. Per fuzz testing by Andreas Seltenreich; the added regression test is based on his example query. Back-patch to 9.2, like the previous patch.
*	Cap wal_buffers to avoid a server crash when it's set very large.	Robert Haas	2015-08-04
\| \| \| \| \| \| \| \| \| \|	It must be possible to multiply wal_buffers by XLOG_BLCKSZ without overflowing int, or calculations in StartupXLOG will go badly wrong and crash the server. Avoid that by imposing a maximum value on wal_buffers. This will be just under 2GB, assuming the usual value for XLOG_BLCKSZ. Josh Berkus, per an analysis by Andrew Gierth.
*	Fix incorrect order of lock file removal and failure to close() sockets.	Tom Lane	2015-08-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit c9b0cbe98bd783e24a8c4d8d8ac472a494b81292 accidentally broke the order of operations during postmaster shutdown: it resulted in removing the per-socket lockfiles after, not before, postmaster.pid. This creates a race-condition hazard for a new postmaster that's started immediately after observing that postmaster.pid has disappeared; if it sees the socket lockfile still present, it will quite properly refuse to start. This error appears to be the explanation for at least some of the intermittent buildfarm failures we've seen in the pg_upgrade test. Another problem, which has been there all along, is that the postmaster has never bothered to close() its listen sockets, but has just allowed them to close at process death. This creates a different race condition for an incoming postmaster: it might be unable to bind to the desired listen address because the old postmaster is still incumbent. This might explain some odd failures we've seen in the past, too. (Note: this is not related to the fact that individual backends don't close their client communication sockets. That behavior is intentional and is not changed by this patch.) Fix by adding an on_proc_exit function that closes the postmaster's ports explicitly, and (in 9.3 and up) reshuffling the responsibility for where to unlink the Unix socket files. Lock file unlinking can stay where it is, but teach it to unlink the lock files in reverse order of creation.
*	Fix race condition that lead to WALInsertLock deadlock with commit_delay.	Heikki Linnakangas	2015-08-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a call to WaitForXLogInsertionsToFinish() returned a value in the middle of a page, and another backend then started to insert a record to the same page, and then you called WaitXLogInsertionsToFinish() again, the second call might return a smaller value than the first call. The problem was in GetXLogBuffer(), which always updated the insertingAt value to the beginning of the requested page, not the actual requested location. Because of that, the second call might return a xlog pointer to the beginning of the page, while the first one returned a later position on the same page. XLogFlush() performs two calls to WaitXLogInsertionsToFinish() in succession, and holds WALWriteLock on the second call, which can deadlock if the second call to WaitXLogInsertionsToFinish() blocks. Reported by Spiros Ioannou. Backpatch to 9.4, where the more scalable WALInsertLock mechanism, and this bug, was introduced.
*	Fix some planner issues with degenerate outer join clauses.	Tom Lane	2015-08-01
\| \| \| \| \| \| \| \| \| \| \| \| \|	An outer join clause that didn't actually reference the RHS (perhaps only after constant-folding) could confuse the join order enforcement logic, leading to wrong query results. Also, nested occurrences of such things could trigger an Assertion that on reflection seems incorrect. Per fuzz testing by Andreas Seltenreich. The practical use of such cases seems thin enough that it's not too surprising we've not heard field reports about it. This has been broken for a long time, so back-patch to all active branches.
*	Fix an oversight in checking whether a join with LATERAL refs is legal.	Tom Lane	2015-07-31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In many cases, we can implement a semijoin as a plain innerjoin by first passing the righthand-side relation through a unique-ification step. However, one of the cases where this does NOT work is where the RHS has a LATERAL reference to the LHS; that makes the RHS dependent on the LHS so that unique-ification is meaningless. joinpath.c understood this, and so would not generate any join paths of this kind ... but join_is_legal neglected to check for the case, so it would think that we could do it. The upshot would be a "could not devise a query plan for the given query" failure once we had failed to generate any join paths at all for the bogus join pair. Back-patch to 9.3 where LATERAL was added.
*	Avoid some zero-divide hazards in the planner.	Tom Lane	2015-07-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although I think on all modern machines floating division by zero results in Infinity not SIGFPE, we still don't want infinities running around in the planner's costing estimates; too much risk of that leading to insane behavior. grouping_planner() failed to consider the possibility that final_rel might be known dummy and hence have zero rowcount. (I wonder if it would be better to set a rows estimate of 1 for dummy relations? But at least in the back branches, changing this convention seems like a bad idea, so I'll leave that for another day.) Make certain that get_variable_numdistinct() produces a nonzero result. The case that can be shown to be broken is with stadistinct < 0.0 and small ntuples; we did not prevent the result from rounding to zero. For good luck I applied clamp_row_est() to all the nonconstant return values. In ExecChooseHashTableSize(), Assert that we compute positive nbuckets and nbatch. I know of no reason to think this isn't the case, but it seems like a good safety check. Per reports from Piotr Stefaniak. Back-patch to all active branches.
*	Reduce chatter from signaling of autovacuum workers.	Tom Lane	2015-07-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't print a WARNING if we get ESRCH from a kill() that's attempting to cancel an autovacuum worker. It's possible (and has been seen in the buildfarm) that the worker is already gone by the time we are able to execute the kill, in which case the failure is harmless. About the only plausible reason for reporting such cases would be to help debug corrupted lock table contents, but this is hardly likely to be the most important symptom if that happens. Moreover issuing a WARNING might scare users more than is warranted. Also, since sending a signal to an autovacuum worker is now entirely a routine thing, and the worker will log the query cancel on its end anyway, reduce the message saying we're doing that from LOG to DEBUG1 level. Very minor cosmetic cleanup as well. Since the main practical reason for doing this is to avoid unnecessary buildfarm failures, back-patch to all active branches.
*	Disable ssl renegotiation by default.	Andres Freund	2015-07-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While postgres' use of SSL renegotiation is a good idea in theory, it turned out to not work well in practice. The specification and openssl's implementation of it have lead to several security issues. Postgres' use of renegotiation also had its share of bugs. Additionally OpenSSL has a bunch of bugs around renegotiation, reported and open for years, that regularly lead to connections breaking with obscure error messages. We tried increasingly complex workarounds to get around these bugs, but we didn't find anything complete. Since these connection breakages often lead to hard to debug problems, e.g. spuriously failing base backups and significant latency spikes when synchronous replication is used, we have decided to change the default setting for ssl renegotiation to 0 (disabled) in the released backbranches and remove it entirely in 9.5 and master.. Author: Michael Paquier, with changes by me Discussion: 20150624144148.GQ4797@alap3.anarazel.de Backpatch: 9.0-9.4; 9.5 and master get a different patch
*	Remove an unsafe Assert, and explain join_clause_is_movable_into() better.	Tom Lane	2015-07-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	join_clause_is_movable_into() is approximate, in the sense that it might sometimes return "false" when actually it would be valid to push the given join clause down to the specified level. This is okay ... but there was an Assert in get_joinrel_parampathinfo() that's only safe if the answers are always exact. Comment out the Assert, and add a bunch of commentary to clarify what's going on. Per fuzz testing by Andreas Seltenreich. The added regression test is a pretty silly query, but it's based on his crasher example. Back-patch to 9.2 where the faulty logic was introduced.
*	Don't assume that PageIsEmpty() returns true on an all-zeros page.	Heikki Linnakangas	2015-07-27
\| \| \| \| \| \| \| \|	It does currently, and I don't see us changing that any time soon, but we don't make that assumption anywhere else. Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that added this assumption.
*	Reuse all-zero pages in GIN.	Heikki Linnakangas	2015-07-27
\| \| \| \| \| \| \| \| \| \|	In GIN, an all-zeros page would be leaked forever, and never reused. Just add them to the FSM in vacuum, and they will be reinitialized when grabbed from the FSM. On master and 9.5, attempting to access the page's opaque struct also caused an assertion failure, although that was otherwise harmless. Reported by Jeff Janes. Backpatch to all supported versions.
*	Fix handling of all-zero pages in SP-GiST vacuum.	Heikki Linnakangas	2015-07-27
\| \| \| \| \| \| \| \| \| \| \| \|	SP-GiST initialized an all-zeros page at vacuum, but that was not WAL-logged, which is not safe. You might get a torn page write, when it gets flushed to disk, and end-up with a half-initialized index page. To fix, leave it in the all-zeros state, and add it to the FSM. It will be initialized when reused. Also don't set the page-deleted flag when recycling an empty page. That was also not WAL-logged, and a torn write of that would cause the page to have an invalid checksum. Backpatch to 9.2, where SP-GiST indexes were added.
*	Make entirely-dummy appendrels get marked as such in set_append_rel_size.	Tom Lane	2015-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The planner generally expects that the estimated rowcount of any relation is at least one row, unless it has been proven empty by constraint exclusion or similar mechanisms, which is marked by installing a dummy path as the rel's cheapest path (cf. IS_DUMMY_REL). When I split up allpaths.c's processing of base rels into separate set_base_rel_sizes and set_base_rel_pathlists steps, the intention was that dummy rels would get marked as such during the "set size" step; this is what justifies an Assert in indxpath.c's get_loop_count that other relations should either be dummy or have positive rowcount. Unfortunately I didn't get that quite right for append relations: if all the child rels have been proven empty then set_append_rel_size would come up with a rowcount of zero, which is correct, but it didn't then do set_dummy_rel_pathlist. (We would have ended up with the right state after set_append_rel_pathlist, but that's too late, if we generate indexpaths for some other rel first.) In addition to fixing the actual bug, I installed an Assert enforcing this convention in set_rel_size; that then allows simplification of a couple of now-redundant tests for zero rowcount in set_append_rel_size. Also, to cover the possibility that third-party FDWs have been careless about not returning a zero rowcount estimate, apply clamp_row_est to whatever an FDW comes up with as the rows estimate. Per report from Andreas Seltenreich. Back-patch to 9.2. Earlier branches did not have the separation between set_base_rel_sizes and set_base_rel_pathlists steps, so there was no intermediate state where an appendrel would have had inconsistent rowcount and pathlist. It's possible that adding the Assert to set_rel_size would be a good idea in older branches too; but since they're not under development any more, it's likely not worth the trouble.
*	Fix off-by-one error in calculating subtrans/multixact truncation point.	Heikki Linnakangas	2015-07-23
\| \| \| \| \| \| \| \| \| \|	If there were no subtransactions (or multixacts) active, we would calculate the oldestxid == next xid. That's correct, but if next XID happens to be on the next pg_subtrans (pg_multixact) page, the page does not exist yet, and SimpleLruTruncate will produce an "apparent wraparound" warning. The warning is harmless in this case, but looks very alarming to users. Backpatch to all supported versions. Patch and analysis by Thomas Munro.
*	Fix add_rte_to_flat_rtable() for recent feature additions.	Tom Lane	2015-07-21
\| \| \| \| \| \| \| \| \| \| \|	The TABLESAMPLE and row security patches each overlooked this function, though their errors of omission were opposite: RLS failed to zero out the securityQuals field, leading to wasteful copying of useless expression trees in finished plans, while TABLESAMPLE neglected to add a comment saying that it intentionally isn't deleting the tablesample subtree. There probably should be a similar comment about ctename, too. Back-patch as appropriate.
*	Make WaitLatchOrSocket's timeout detection more robust.	Tom Lane	2015-07-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the previous coding, timeout would be noticed and reported only when poll() or socket() returned zero (or the equivalent behavior on Windows). Ordinarily that should work well enough, but it seems conceivable that we could get into a state where poll() always returns a nonzero value --- for example, if it is noticing a condition on one of the file descriptors that we do not think is reason to exit the loop. If that happened, we'd be in a busy-wait loop that would fail to terminate even when the timeout expires. We can make this more robust at essentially no cost, by deciding to exit of our own accord if we compute a zero or negative time-remaining-to-wait. Previously the code noted this but just clamped the time-remaining to zero, expecting that we'd detect timeout on the next loop iteration. Back-patch to 9.2. While 9.1 had a version of WaitLatchOrSocket, it was primitive compared to later versions, and did not guarantee reliable detection of timeouts anyway. (Essentially, this is a refinement of commit 3e7fdcffd6f77187, which was back-patched only as far as 9.2.)
*	Fix a low-probability crash in our qsort implementation.	Tom Lane	2015-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's standard for quicksort implementations, after having partitioned the input into two subgroups, to recurse to process the smaller partition and then handle the larger partition by iterating. This method guarantees that no more than log2(N) levels of recursion can be needed. However, Bentley and McIlroy argued that checking to see which partition is smaller isn't worth the cycles, and so their code doesn't do that but just always recurses on the left partition. In most cases that's fine; but with worst-case input we might need O(N) levels of recursion, and that means that qsort could be driven to stack overflow. Such an overflow seems to be the only explanation for today's report from Yiqing Jin of a SIGSEGV in med3_tuple while creating an index of a couple billion entries with a very large maintenance_work_mem setting. Therefore, let's spend the few additional cycles and lines of code needed to choose the smaller partition for recursion. Also, fix up the qsort code so that it properly uses size_t not int for some intermediate values representing numbers of items. This would only be a live risk when sorting more than INT_MAX bytes (in qsort/qsort_arg) or tuples (in qsort_tuple), which I believe would never happen with any caller in the current core code --- but perhaps it could happen with call sites in third-party modules? In any case, this is trouble waiting to happen, and the corrected code is probably if anything shorter and faster than before, since it removes sign-extension steps that had to happen when converting between int and size_t. In passing, move a couple of CHECK_FOR_INTERRUPTS() calls so that it's not necessary to preserve the value of "r" across them, and prettify the output of gen_qsort_tuple.pl a little. Back-patch to all supported branches. The odds of hitting this issue are probably higher in 9.4 and up than before, due to the new ability to allocate sort workspaces exceeding 1GB, but there's no good reason to believe that it's impossible to crash older branches this way.
*	Fix spelling error	Magnus Hagander	2015-07-16
\| \| \| \|	David Rowley
*	AIX: Link the postgres executable with -Wl,-brtllib.	Noah Misch	2015-07-15
\| \| \| \| \| \| \| \| \|	This allows PostgreSQL modules and their dependencies to have undefined symbols, resolved at runtime. Perl module shared objects rely on that in Perl 5.8.0 and later. This fixes the crash when PL/PerlU loads such modules, as the hstore_plperl test suite does. Module authors can link using -Wl,-G to permit undefined symbols; by default, linking will fail as it has. Back-patch to 9.0 (all supported versions).
*	Fix postmaster's handling of a startup-process crash.	Tom Lane	2015-07-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ordinarily, a failure (unexpected exit status) of the startup subprocess should be considered fatal, so the postmaster should just close up shop and quit. However, if we sent the startup process a SIGQUIT or SIGKILL signal, the failure is hardly "unexpected", and we should attempt restart; this is necessary for recovery from ordinary backend crashes in hot-standby scenarios. I attempted to implement the latter rule with a two-line patch in commit 442231d7f71764b8c628044e7ce2225f9aa43b67, but it now emerges that that patch was a few bricks shy of a load: it failed to distinguish the case of a signaled startup process from the case where the new startup process crashes before reaching database consistency. That resulted in infinitely respawning a new startup process only to have it crash again. To handle this properly, we really must track whether we have sent the current startup process a kill signal. Rather than add yet another ad-hoc boolean to the postmaster's state, I chose to unify this with the existing RecoveryError flag into an enum tracking the startup process's state. That seems more consistent with the postmaster's general state machine design. Back-patch to 9.0, like the previous patch.
*	Fix logical decoding bug leading to inefficient reopening of files.	Andres Freund	2015-07-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When spilling transaction data to disk a simple typo caused the output file to be closed and reopened for every serialized change. That happens to not have a huge impact on linux, which is why it probably wasn't noticed so far, but on windows that appears to trigger actual disk writes after every change. Not fun. The bug fortunately does not have any impact besides speed. A change could end up being in the wrong segment (last instead of next), but since we read all files to the end, that's just ugly, not really problematic. It's not a problem to upgrade, since transaction spill files do not persist across restarts. Bug: #13484 Reported-By: Olivier Gosseaume Discussion: 20150703090217.1190.63940@wrigleys.postgresql.org Backpatch to 9.4, where logical decoding was added.
*	Don't call PageGetSpecialPointer() on page until it's been initialized.	Heikki Linnakangas	2015-06-30
\| \| \| \| \| \| \| \| \| \|	After calling XLogInitBufferForRedo(), the page might be all-zeros if it was not in page cache already. btree_xlog_unlink_page initialized the page correctly, but it called PageGetSpecialPointer before initializing it, which would lead to a corrupt page at WAL replay, if the unlinked page is not in page cache. Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
*	Back-patch some minor bug fixes in GUC code.	Tom Lane	2015-06-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 9.4, fix a 9.4.1 regression that allowed multiple entries for a PGC_POSTMASTER variable to cause bogus complaints in the postmaster log. (The issue here was that commit bf007a27acd7b2fb unintentionally reverted 3e3f65973a3c94a6, which suppressed any duplicate entries within ParseConfigFp. Back-patch the reimplementation just made in HEAD, which makes use of an "ignore" field to prevent application of superseded items.) Add missed failure check in AlterSystemSetConfigFile(). We don't really expect ParseConfigFp() to fail, but that's not an excuse for not checking. In both 9.3 and 9.4, remove mistaken assignment to ConfigFileLineno that caused line counting after an include_dir directive to be completely wrong.
*	Fix comment for GetCurrentIntegerTimestamp().	Kevin Grittner	2015-06-28
\| \| \| \| \| \|	The unit of measure is microseconds, not milliseconds. Backpatch to 9.3 where the function and its comment were added.
*	Revoke incorrectly applied patch version	Simon Riggs	2015-06-27
\|
*	Avoid hot standby cancels from VAC FREEZE	Simon Riggs	2015-06-27
\| \| \| \| \| \| \| \| \| \| \| \|	VACUUM FREEZE generated false cancelations of standby queries on an otherwise idle master. Caused by an off-by-one error on cutoff_xid which goes back to original commit. Backpatch to all versions 9.0+ Analysis and report by Marco Nenciarini Bug fix by Simon Riggs
*	Fix a couple of bugs with wal_log_hints.	Heikki Linnakangas	2015-06-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Replay of the WAL record for setting a bit in the visibility map contained an assertion that a full-page image of that record type can only occur with checksums enabled. But it can also happen with wal_log_hints, so remove the assertion. Unlike checksums, wal_log_hints can be changed on the fly, so it would be complicated to figure out if it was enabled at the time that the WAL record was generated. 2. wal_log_hints has the same effect on the locking needed to read the LSN of a page as data checksums. BufferGetLSNAtomic() didn't get the memo. Backpatch to 9.4, where wal_log_hints was added.
*	Allow background workers to connect to no particular database.	Robert Haas	2015-06-25
\| \| \| \| \| \| \|	The documentation claims that this is supported, but it didn't actually work. Fix that. Reported by Pavel Stehule; patch by me.
*	Fix the logic for putting relations into the relcache init file.	Tom Lane	2015-06-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit f3b5565dd4e59576be4c772da364704863e6a835 was a couple of bricks shy of a load; specifically, it missed putting pg_trigger_tgrelid_tgname_index into the relcache init file, because that index is not used by any syscache. However, we have historically nailed that index into cache for performance reasons. The upshot was that load_relcache_init_file always decided that the init file was busted and silently ignored it, resulting in a significant hit to backend startup speed. To fix, reinstantiate RelationIdIsInInitFile() as a wrapper around RelationSupportsSysCache(), which can know about additional relations that should be in the init file despite being unknown to syscache.c. Also install some guards against future mistakes of this type: make write_relcache_init_file Assert that all nailed relations get written to the init file, and make load_relcache_init_file emit a WARNING if it takes the "wrong number of nailed relations" exit path. Now that we remove the init files during postmaster startup, that case should never occur in the field, even if we are starting a minor-version update that added or removed rels from the nailed set. So the warning shouldn't ever be seen by end users, but it will show up in the regression tests if somebody breaks this logic. Back-patch to all supported branches, like the previous commit.
*	Improve inheritance_planner()'s performance for large inheritance sets.	Tom Lane	2015-06-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit c03ad5602f529787968fa3201b35c119bbc6d782 introduced a planner performance regression for UPDATE/DELETE on large inheritance sets. It required copying the append_rel_list (which is of size proportional to the number of inherited tables) once for each inherited table, thus resulting in O(N^2) time and memory consumption. While it's difficult to avoid that in general, the extra work only has to be done for append_rel_list entries that actually reference subquery RTEs, which inheritance-set entries will not. So we can buy back essentially all of the loss in cases without subqueries in FROM; and even for those, the added work is mainly proportional to the number of UNION ALL subqueries. Back-patch to 9.2, like the previous commit. Tom Lane and Dean Rasheed, per a complaint from Thomas Munro.
*	Improve multixact emergency autovacuum logic.	Andres Freund	2015-06-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously autovacuum was not necessarily triggered if space in the members slru got tight. The first problem was that the signalling was tied to values in the offsets slru, but members can advance much faster. Thats especially a problem if old sessions had been around that previously prevented the multixact horizon to increase. Secondly the skipping logic doesn't work if the database was restarted after autovacuum was triggered - that knowledge is not preserved across restart. This is especially a problem because it's a common panic-reaction to restart the database if it gets slow to anti-wraparound vacuums. Fix the first problem by separating the logic for members from offsets. Trigger autovacuum whenever a multixact crosses a segment boundary, as the current member offset increases in irregular values, so we can't use a simple modulo logic as for offsets. Add a stopgap for the second problem, by signalling autovacuum whenver ERRORing out because of boundaries. Discussion: 20150608163707.GD20772@alap3.anarazel.de Backpatch into 9.3, where it became more likely that multixacts wrap around.
*	Add missing check for wal_debug GUC.	Andres Freund	2015-06-21
\| \| \| \| \| \| \| \| \| \|	9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The other WAL_DEBUG dependant messages check for the wal_debug GUC, but this one did not. While at it replace 'upto' with 'up to'. Discussion: 20150610110253.GF3832@alap3.anarazel.de Backpatch to 9.4, the first release containing 9a20a9b2.
*	Fix failure to copy setlocale() return value.	Noah Misch	2015-06-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	POSIX permits setlocale() calls to invalidate any previous setlocale() return values, but commit 5f538ad004aa00cf0881f179f0cde789aad4f47e neglected to account for setlocale(LC_CTYPE, NULL) doing so. The effect was to set the LC_CTYPE environment variable to an unintended value. pg_perm_setlocale() sets this variable to assist PL/Perl; without it, Perl would undo PostgreSQL's locale settings. The known-affected configurations are 32-bit, release builds using Visual Studio 2012 or Visual Studio 2013. Visual Studio 2010 is unaffected, as were all buildfarm-attested configurations. In principle, this bug could leave the wrong LC_CTYPE in effect after PL/Perl use, which could in turn facilitate problems like corrupt tsvector datums. No known platform experiences that consequence, because PL/Perl on Windows does not use this environment variable. The bug has been user-visible, as early postmaster failure, on systems with Windows ANSI code page set to CP936 for "Chinese (Simplified, PRC)" and probably on systems using other multibyte code pages. (SetEnvironmentVariable() rejects values containing character data not valid under the Windows ANSI code page.) Back-patch to 9.4, where the faulty commit first appeared. Reported by Didi Hu and 林鹏程. Reviewed by Tom Lane, though this fix strategy was not his first choice.
*	Fix thinko in comment (launcher -> worker)	Alvaro Herrera	2015-06-20
\|
*	In immediate shutdown, postmaster should not exit till children are gone.	Tom Lane	2015-06-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adjusts commit 82233ce7ea42d6ba519aaec63008aff49da6c7af so that the postmaster does not exit until all its child processes have exited, even if the 5-second timeout elapses and we have to send SIGKILL. There is no great value in having the postmaster process quit sooner, and doing so can mislead onlookers into thinking that the cluster is fully terminated when actually some child processes still survive. This effect might explain recent test failures on buildfarm member hamster, wherein we failed to restart a cluster just after shutting it down with "pg_ctl stop -m immediate". I also did a bit of code review/beautification, including fixing a faulty use of the Max() macro on a volatile expression. Back-patch to 9.4. In older branches, the postmaster never waited for children to exit during immediate shutdowns, and changing that would be too much of a behavioral change.
*	Clamp autovacuum launcher sleep time to 5 minutes	Alvaro Herrera	2015-06-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This avoids the problem that it might go to sleep for an unreasonable amount of time in unusual conditions like the server clock moving backwards an unreasonable amount of time. (Simply moving the server clock forward again doesn't solve the problem unless you wake up the autovacuum launcher manually, say by sending it SIGHUP). Per trouble report from Prakash Itnal in https://www.postgresql.org/message-id/CAHC5u79-UqbapAABH2t4Rh2eYdyge0Zid-X=Xz-ZWZCBK42S0Q@mail.gmail.com Analyzed independently by Haribabu Kommi and Tom Lane.
*	Fix corner case in autovacuum-forcing logic for multixact wraparound.	Robert Haas	2015-06-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(), and that function looks only at the on-disk state, it's possible for it to fail to find a page that exists in the in-memory SLRU that has not been written yet. If that happens, SetOffsetVacuumLimit() will erroneously decide to force emergency autovacuuming immediately. We should probably fix find_multixact_start() to consider the data cached in memory as well as on the on-disk state, but that's no excuse for SetOffsetVacuumLimit() to be stupid about the case where it can no longer read the value after having previously succeeded in doing so. Report by Andres Freund.
*	Improve error message and hint for ALTER COLUMN TYPE can't-cast failure.	Tom Lane	2015-06-12
\| \| \| \| \| \| \| \| \| \|	We already tried to improve this once, but the "improved" text was rather off-target if you had provided a USING clause. Also, it seems helpful to provide the exact text of a suggested USING clause, so users can just copy-and-paste it when needed. Per complaint from Keith Rarick and a suggestion from Merlin Moncure. Back-patch to 9.2 where the current wording was adopted.
*	Report more information if pg_perm_setlocale() fails at startup.	Tom Lane	2015-06-09
\| \| \| \| \| \|	We don't know why a few Windows users have seen this fail, but the taciturnity of the error message certainly isn't helping debug it. Let's at least find out which LC category isn't working.
*	Allow HotStandbyActiveInReplay() to be called in single user mode.	Andres Freund	2015-06-08
\| \| \| \| \| \| \| \| \| \| \| \|	HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL replay to happen in the startup process, missing the single user case. This buglet is fairly harmless as it only causes problems when single user mode in an assertion enabled build is used to replay a btree vacuum record. Backpatch to 9.2. 061b079f was backpatched further, but the assertion was not.
*	Use a safer method for determining whether relcache init file is stale.	Tom Lane	2015-06-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we invalidate the relcache entry for a system catalog or index, we must also delete the relcache "init file" if the init file contains a copy of that rel's entry. The old way of doing this relied on a specially maintained list of the OIDs of relations present in the init file: we made the list either when reading the file in, or when writing the file out. The problem is that when writing the file out, we included only rels present in our local relcache, which might have already suffered some deletions due to relcache inval events. In such cases we correctly decided not to overwrite the real init file with incomplete data --- but we still used the incomplete initFileRelationIds list for the rest of the current session. This could result in wrong decisions about whether the session's own actions require deletion of the init file, potentially allowing an init file created by some other concurrent session to be left around even though it's been made stale. Since we don't support changing the schema of a system catalog at runtime, the only likely scenario in which this would cause a problem in the field involves a "vacuum full" on a catalog concurrently with other activity, and even then it's far from easy to provoke. Remarkably, this has been broken since 2002 (in commit 786340441706ac1957a031f11ad1c2e5b6e18314), but we had never seen a reproducible test case until recently. If it did happen in the field, the symptoms would probably involve unexpected "cache lookup failed" errors to begin with, then "could not open file" failures after the next checkpoint, as all accesses to the affected catalog stopped working. Recovery would require manually removing the stale "pg_internal.init" file. To fix, get rid of the initFileRelationIds list, and instead consult syscache.c's list of relations used in catalog caches to decide whether a relation is included in the init file. This should be a tad more efficient anyway, since we're replacing linear search of a list with ~100 entries with a binary search. It's a bit ugly that the init file contents are now so directly tied to the catalog caches, but in practice that won't make much difference. Back-patch to all supported branches.
*	Fix incorrect order of database-locking operations in InitPostgres().	Tom Lane	2015-06-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We should set MyProc->databaseId after acquiring the per-database lock, not beforehand. The old way risked deadlock against processes trying to copy or delete the target database, since they would first acquire the lock and then wait for processes with matching databaseId to exit; that left a window wherein an incoming process could set its databaseId and then block on the lock, while the other process had the lock and waited in vain for the incoming process to exit. CountOtherDBBackends() would time out and fail after 5 seconds, so this just resulted in an unexpected failure not a permanent lockup, but it's still annoying when it happens. A real-world example of a use-case is that short-duration connections to a template database should not cause CREATE DATABASE to fail. Doing it in the other order should be fine since the contract has always been that processes searching the ProcArray for a database ID must hold the relevant per-database lock while searching. Thus, this actually removes the former race condition that required an assumption that storing to MyProc->databaseId is atomic. It's been like this for a long time, so back-patch to all active branches.
*	Cope with possible failure of the oldest MultiXact to exist.	Robert Haas	2015-06-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent commits, mainly b69bf30b9bfacafc733a9ba77c9587cf54d06c0c and 53bb309d2d5a9432d2602c93ed18e58bd2924e15, introduced mechanisms to protect against wraparound of the MultiXact member space: the number of multixacts that can exist at one time is limited to 2^32, but the total number of members in those multixacts is also limited to 2^32, and older code did not take care to enforce the second limit, potentially allowing old data to be overwritten while it was still needed. Unfortunately, these new mechanisms failed to account for the fact that the code paths in which they run might be executed during recovery or while the cluster was in an inconsistent state. Also, they failed to account for the fact that users who used pg_upgrade to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have might oldestMultiXid = 1 in the control file despite the true value being larger. To fix these problems, first, avoid unnecessarily examining the mmembers of MultiXacts when the cluster is not known to be consistent. TruncateMultiXact has done this for a long time, and this patch does not fix that. But the new calls used to prevent member wraparound are not needed until we reach normal running, so avoid calling them earlier. (SetMultiXactIdLimit is actually called before InRecovery is set, so we can't rely on that; we invent our own multixact-specific flag instead.) Second, make failure to look up the members of a MultiXact a non-fatal error. Instead, if we're unable to determine the member offset at which wraparound would occur, postpone arming the member wraparound defenses until we are able to do so. If we're unable to determine the member offset that should force autovacuum, force it continuously until we are able to do so. If we're unable to deterine the member offset at which we should truncate the members SLRU, log a message and skip truncation. An important consequence of these changes is that anyone who does have a bogus oldestMultiXid = 1 value in pg_control will experience immediate emergency autovacuuming when upgrading to a release that contains this fix. The release notes should highlight this fact. If a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1 in the control file, they may wish to vacuum any tables with relminmxid = 1 prior to upgrading in order to avoid an immediate emergency autovacuum after the upgrade. This must be done with a PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age set to 0. This patch also adds an additional log message at each database server startup, indicating either that protections against member wraparound have been engaged, or that they have not. In the latter case, once autovacuum has advanced oldestMultiXid to a sane value, the message indicating that the guards have been engaged will appear at the next checkpoint. A few additional messages have also been added at the DEBUG1 level so that the correct operation of this code can be properly audited. Along the way, this patch fixes another, related bug in TruncateMultiXact that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at all, the truncation code looks up NextMultiXactId, which doesn't exist yet. This can lead to TruncateMultiXact removing every file in pg_multixact/offsets instead of keeping one around, as it should. This in turn will cause the database server to refuse to start afterwards. Patch by me. Review by Álvaro Herrera, Andres Freund, Noah Misch, and Thomas Munro.
*	pgindent run on access/transam/multixact.c	Alvaro Herrera	2015-06-04
\| \| \| \| \| \| \| \| \|	This file has been patched over and over, and the differences to master caused by pgindent are annoying enough that it seems saner to make the older branches look the same. Backpatch to 9.3, which is as far back as backpatching of bugfixes is necessary.
*	Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.	Tom Lane	2015-06-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the inner side of a nestloop SEMI or ANTI join is an indexscan that uses all the join clauses as indexquals, it can be presumed that both matched and unmatched outer rows will be processed very quickly: for matched rows, we'll stop after fetching one row from the indexscan, while for unmatched rows we'll have an indexscan that finds no matching index entries, which should also be quick. The planner already knew about this, but it was nonetheless charging for at least one full run of the inner indexscan, as a consequence of concerns about the behavior of materialized inner scans --- but those concerns don't apply in the fast case. If the inner side has low cardinality (many matching rows) this could make an indexscan plan look far more expensive than it actually is. To fix, rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we don't add the inner scan cost until we've inspected the indexquals, and then we can add either the full-run cost or just the first tuple's cost as appropriate. Experimentation with this fix uncovered another problem: add_path and friends were coded to disregard cheap startup cost when considering parameterized paths. That's usually okay (and desirable, because it thins the path herd faster); but in this fast case for SEMI/ANTI joins, it could result in throwing away the desired plain indexscan path in favor of a bitmap scan path before we ever get to the join costing logic. In the many-matching-rows cases of interest here, a bitmap scan will do a lot more work than required, so this is a problem. To fix, add a per-relation flag consider_param_startup that works like the existing consider_startup flag, but applies to parameterized paths, and set it for relations that are the inside of a SEMI or ANTI join. To make this patch reasonably safe to back-patch, care has been taken to avoid changing the planner's behavior except in the very narrow case of SEMI/ANTI joins with inner indexscans. There are places in compare_path_costs_fuzzily and add_path_precheck that are not terribly consistent with the new approach, but changing them will affect planner decisions at the margins in other cases, so we'll leave that for a HEAD-only fix. Back-patch to 9.3; before that, the consider_startup flag didn't exist, meaning that the second aspect of the patch would be too invasive. Per a complaint from Peter Holzer and analysis by Tomas Vondra.
*	Remove special cases for ETXTBSY from new fsync'ing logic.	Tom Lane	2015-05-29
\| \| \| \| \| \| \| \| \| \|	The argument that this is a sufficiently-expected case to be silently ignored seems pretty thin. Andres had brought it up back when we were still considering that most fsync failures should be hard errors, and it probably would be legit not to fail hard for ETXTBSY --- but the same is true for EROFS and other cases, which is why we gave up on hard failures. ETXTBSY is surely not a normal case, so logging the failure seems fine from here.
*	Fix fsync-at-startup code to not treat errors as fatal.	Tom Lane	2015-05-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane
*	Fix pg_get_functiondef() to print a function's LEAKPROOF property.	Tom Lane	2015-05-28
\| \| \| \| \| \| \|	Seems to have been an oversight in the original leakproofness patch. Per report and patch from Jeevan Chalke. In passing, prettify some awkward leakproof-related code in AlterFunction.