aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Fix possible pg_basebackup failure on standby with "include WAL".Robert Haas2016-10-27
| | | | | | | | | | | | | | | If a restartpoint flushed no dirty buffers, it could fail to update the minimum recovery point, leading to a minimum recovery point prior to the starting REDO location. perform_base_backup() would interpret that as meaning that no WAL files at all needed to be included in the backup, failing an internal sanity check. To fix, have restartpoints always update the minimum recovery point to just after the checkpoint record itself, so that the file (or files) containing the checkpoint record will always be included in the backup. Code by Amit Kapila, per a design suggestion by me, with some additional work on the code comment by me. Test case by Michael Paquier. Report by Kyotaro Horiguchi.
* Preserve commit timestamps across clean restartAlvaro Herrera2016-10-24
| | | | | | | | | An oversight in setting the boundaries of known commit timestamps during startup caused old commit timestamps to become inaccessible after a server restart. Author and reporter: Julien Rouhaud Review, test code: Craig Ringer
* Fix WAL-logging of FSM and VM truncation.Heikki Linnakangas2016-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | When a relation is truncated, it is important that the FSM is truncated as well. Otherwise, after recovery, the FSM can return a page that has been truncated away, leading to errors like: ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes We were using MarkBufferDirtyHint() to dirty the buffer holding the last remaining page of the FSM, but during recovery, that might in fact not dirty the page, and the FSM update might be lost. To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty() requires us to do WAL-logging ourselves, to protect from a torn page, if checksumming is enabled. Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log when checksumming is enabled. Analysis by Pavan Deolasee. Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com> Backpatch to 9.3, where we got data checksums.
* Fix outdated comments, GIST search queue is not an RBTree anymore.Heikki Linnakangas2016-09-20
| | | | | | The GiST search queue is implemented as a pairing heap rather than as Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments in that commit.
* Fix locking a tuple updated by an aborted (sub)transactionAlvaro Herrera2016-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When heap_lock_tuple decides to follow the update chain, it tried to also lock any version of the tuple that was created by an update that was subsequently rolled back. This is pointless, since for all intents and purposes that tuple exists no more; and moreover it causes misbehavior, as reported independently by Marko Tiikkaja and Marti Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return the tuples, and assertion-enabled builds crash. Fix by having heap_lock_updated_tuple test the xmin and return success immediately if the tuple was created by an aborted transaction. The condition where tuples become invisible occurs when an updated tuple chain is followed by heap_lock_updated_tuple, which reports the problem as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn propagates that code outwards possibly leading the calling code (ExecLockRows) to believe that the tuple exists no longer. Backpatch to 9.3. Only on 9.5 and newer this leads to a visible failure, because of commit 27846f02c176; before that, heap_lock_tuple skips the whole dance when the tuple is already locked by the same transaction, because of the ancient HeapTupleSatisfiesUpdate behavior. Still, the buggy condition may also exist in more convoluted scenarios involving concurrent transactions, so it seems safer to fix the bug in the old branches too. Discussion: https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to
* Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.Tom Lane2016-09-03
| | | | | | | | | | | | | | | | | | | | computeLeafRecompressWALData() tried to produce a uint16 WAL log field by memcpy'ing the first two bytes of an int-sized variable. That accidentally works on little-endian hardware, but not at all on big-endian. Replay then thinks it's looking at an ADDITEMS action with zero entries, and reads the first two bytes of the first TID therein as the next segno/action, typically leading to "unexpected GIN leaf action" errors during replay. Even if replay failed to crash, the resulting GIN index page would surely be incorrect. To fix, just declare the variable as uint16 instead. Per bug #14295 from Spencer Thomason (much thanks to Spencer for turning his problem into a self-contained test case). This likely also explains a previous report of the same symptom from Bernd Helmle. Back-patch to 9.4 where the problem was introduced (by commit 14d02f0bb). Discussion: <20160826072658.15676.7628@wrigleys.postgresql.org> Possible-Report: <2DA7350F7296B2A142272901@eje.land.credativ.lan>
* Prevent starting a standalone backend with standby_mode on.Tom Lane2016-08-31
| | | | | | | | | | | | | | | | | | | This can't really work because standby_mode expects there to be more WAL arriving, which there will not ever be because there's no WAL receiver process to fetch it. Moreover, if standby_mode is on then hot standby might also be turned on, causing even more strangeness because that expects read-only sessions to be executing in parallel. Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid got confused, but rather than band-aiding individual problems it seems best to prevent getting anywhere near this state in the first place. Back-patch to all supported branches. In passing, also fix some omissions of errcodes in other ereport's in readRecoveryCommandFile(). Michael Paquier (errcode hacking by me) Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>
* Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record.Fujii Masao2016-08-29
| | | | | | | | | | | | | | | | | | Previously pg_xlogdump failed to dump the contents of the WAL file if the file starts with the continuation WAL record which spans more than one pages. Since pg_xlogdump assumed that the continuation record always fits on a page, it could not find the valid WAL record to start reading from in that case. This patch changes pg_xlogdump so that it can handle a continuation WAL record which crosses a page boundary and find the valid record to start reading from. Back-patch to 9.3 where pg_xlogdump was introduced. Author: Pavan Deolasee Reviewed-By: Michael Paquier and Craig Ringer Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
* Fix potential memory leakage from HandleParallelMessages().Tom Lane2016-08-26
| | | | | | | | | | | | | | | | HandleParallelMessages leaked memory into the caller's context. Since it's called from ProcessInterrupts, there is basically zero certainty as to what CurrentMemoryContext is, which means we could be leaking into long-lived contexts. Over the processing of many worker messages that would grow to be a problem. Things could be even worse than just a leak, if we happened to service the interrupt while ErrorContext is current: elog.c thinks it can reset that on its own whim, possibly yanking storage out from under HandleParallelMessages. Give HandleParallelMessages its own dedicated context instead, which we can reset during each call to ensure there's no accumulation of wasted memory. Discussion: <16610.1472222135@sss.pgh.pa.us>
* Fix small query-lifespan memory leak in bulk updates.Tom Lane2016-08-24
| | | | | | | | | | | | | When there is an identifiable REPLICA IDENTITY index on the target table, heap_update leaks the id_attrs bitmapset. That's not many bytes, but it adds up over enough rows, since the code typically runs in a query-lifespan context. Bug introduced in commit e55704d8b, which did a rather poor job of cloning the existing use-pattern for RelationGetIndexAttrBitmap(). Per bug #14293 from Zhou Digoal. Back-patch to 9.4 where the bug was introduced. Report: <20160824114320.15676.45171@wrigleys.postgresql.org>
* Fix deletion of speculatively inserted TOAST on conflictAndres Freund2016-08-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | INSERT .. ON CONFLICT runs a pre-check of the possible conflicting constraints before performing the actual speculative insertion. In case the inserted tuple included TOASTed columns the ON CONFLICT condition would be handled correctly in case the conflict was caught by the pre-check, but if two transactions entered the speculative insertion phase at the same time, one would have to re-try, and the code for aborting a speculative insertion did not handle deleting the speculatively inserted TOAST datums correctly. TOAST deletion would fail with "ERROR: attempted to delete invisible tuple" as we attempted to remove the TOAST tuples using simple_heap_delete which reasoned that the given tuples should not be visible to the command that wrote them. This commit updates the heap_abort_speculative() function which aborts the conflicting tuple to use itself, via toast_delete, for deleting associated TOAST datums. Like before, the inserted toast rows are not marked as being speculative. This commit also adds a isolationtester spec test, exercising the relevant code path. Unfortunately 9.5 cannot handle two waiting sessions, and thus cannot execute this test. Reported-By: Viren Negi, Oskari Saarenmaa Author: Oskari Saarenmaa, edited a bit by me Bug: #14150 Discussion: <20160519123338.12513.20271@wrigleys.postgresql.org> Backpatch: 9.5, where ON CONFLICT was introduced
* In B-tree page deletion, clean up properly after page deletion failure.Tom Lane2016-08-06
| | | | | | | | | | | | | | | | | | | | | | In _bt_unlink_halfdead_page(), we might fail to find an immediate left sibling of the target page, perhaps because of corruption of the page sibling links. The code intends to cope with this by just abandoning the deletion attempt; but what actually happens is that it fails outright due to releasing the same buffer lock twice. (And error recovery masks a second problem, which is possible leakage of a pin on another page.) Seems to have been introduced by careless refactoring in commit efada2b8e. Since there are multiple cases to consider, let's make releasing the buffer lock in the failure case the responsibility of _bt_unlink_halfdead_page() not its caller. Also, avoid fetching the leaf page's left-link again after we've dropped lock on the page. This is probably harmless, but it's not exactly good coding practice. Per report from Kyotaro Horiguchi. Back-patch to 9.4 where the faulty code was introduced. Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
* Block interrupts during HandleParallelMessages().Tom Lane2016-08-02
| | | | | | | | | | | | | | | | As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c functions called by HandleParallelMessages(). I believe they're all unreachable since we always pass nowait = true, but it doesn't seem like a great idea to assume that no such call will ever be reachable from HandleParallelMessages(). If that did happen, there would be a risk of a recursive call to HandleParallelMessages(), which it does not appear to be designed for --- for example, there's nothing that would prevent out-of-order processing of received messages. And certainly such cases cannot easily be tested. So let's prevent it by holding off interrupts for the duration of the function. Back-patch to 9.5 which contains identical code. Discussion: <14869.1470083848@sss.pgh.pa.us>
* Sync 9.5 version of access/transam/parallel.c with HEAD.Tom Lane2016-08-02
| | | | | | | | | This back-patches commit a5fe473ad (notably, marking ParallelMessagePending as volatile, which is not particularly optional). I also back-patched some previous cosmetic changes to remove unnecessary diffs between the two branches. I'm unsure how much of this code is actually reachable in 9.5, but to the extent that it is reachable, it needs to be maintained, and minimizing cross-branch diffs will make that easier.
* Fix torn-page, unlogged xid and further risks from heap_update().Andres Freund2016-07-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When heap_update needs to look for a page for the new tuple version, because the current one doesn't have sufficient free space, or when columns have to be processed by the tuple toaster, it has to release the lock on the old page during that. Otherwise there'd be lock ordering and lock nesting issues. To avoid concurrent sessions from trying to update / delete / lock the tuple while the page's content lock is released, the tuple's xmax is set to the current session's xid. That unfortunately was done without any WAL logging, thereby violating the rule that no XIDs may appear on disk, without an according WAL record. If the database were to crash / fail over when the page level lock is released, and some activity lead to the page being written out to disk, the xid could end up being reused; potentially leading to the row becoming invisible. There might be additional risks by not having t_ctid point at the tuple itself, without having set the appropriate lock infomask fields. To fix, compute the appropriate xmax/infomask combination for locking the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK record. That allows the fix to be backpatched. This issue has existed for a long time. There appears to have been partial attempts at preventing dangers, but these never have fully been implemented, and were removed a long time ago, in 11919160 (cf. HEAP_XMAX_UNLOGGED). In master / 9.6, there's an additional issue, namely that the visibilitymap's freeze bit isn't reset at that point yet. Since that's a new issue, introduced only in a892234f830, that'll be fixed in a separate commit. Author: Masahiko Sawada and Andres Freund Reported-By: Different aspects by Thomas Munro, Noah Misch, and others Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com Backpatch: 9.1/all supported versions
* Make HEAP_LOCK/HEAP2_LOCK_UPDATED replay reset HEAP_XMAX_INVALID.Andres Freund2016-07-15
| | | | | | | | | | | | | | | 0ac5ad5 started to compress infomask bits in WAL records. Unfortunately the replay routines for XLOG_HEAP_LOCK/XLOG_HEAP2_LOCK_UPDATED forgot to reset the HEAP_XMAX_INVALID (and some other) hint bits. Luckily that's not problematic in the majority of cases, because after a crash/on a standby row locks aren't meaningful. Unfortunately that does not hold true in the presence of prepared transactions. This means that after a crash, or after promotion, row level locks held by a prepared, but not yet committed, prepared transaction might not be enforced. Discussion: 20160715192319.ubfuzim4zv3rqnxv@alap3.anarazel.de Backpatch: 9.3, the oldest branch on which 0ac5ad5 is present.
* Avoid serializability errors when locking a tuple with a committed updateAlvaro Herrera2016-07-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When key-share locking a tuple that has been not-key-updated, and the update is a committed transaction, in some cases we raised serializability errors: ERROR: could not serialize access due to concurrent update Because the key-share doesn't conflict with the update, the error is unnecessary and inconsistent with the case that the update hasn't committed yet. This causes problems for some usage patterns, even if it can be claimed that it's sufficient to retry the aborted transaction: given a steady stream of updating transactions and a long locking transaction, the long transaction can be starved indefinitely despite multiple retries. To fix, we recognize that HeapTupleSatisfiesUpdate can return HeapTupleUpdated when an updating transaction has committed, and that we need to deal with that case exactly as if it were a non-committed update: verify whether the two operations conflict, and if not, carry on normally. If they do conflict, however, there is a difference: in the HeapTupleBeingUpdated case we can just sleep until the concurrent transaction is gone, while in the HeapTupleUpdated case this is not possible and we must raise an error instead. Per trouble report from Olivier Dony. In addition to a couple of test cases that verify the changed behavior, I added a test case to verify the behavior that remains unchanged, namely that errors are raised when a update that modifies the key is used. That must still generate serializability errors. One pre-existing test case changes behavior; per discussion, the new behavior is actually the desired one. Discussion: https://www.postgresql.org/message-id/560AA479.4080807@odoo.com https://www.postgresql.org/message-id/20151014164844.3019.25750@wrigleys.postgresql.org Backpatch to 9.3, where the problem appeared.
* Fix GiST index build for NaN values in geometric types.Tom Lane2016-07-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GiST index build could go into an infinite loop when presented with boxes (or points, circles or polygons) containing NaN component values. This happened essentially because the code assumed that x == x is true for any "double" value x; but it's not true for NaNs. The looping behavior was not the only problem though: we also attempted to sort the items using simple double comparisons. Since NaNs violate the trichotomy law, qsort could (in principle at least) get arbitrarily confused and mess up the sorting of ordinary values as well as NaNs. And we based splitting choices on box size calculations that could produce NaNs, again resulting in undesirable behavior. To fix, replace all comparisons of doubles in this logic with float8_cmp_internal, which is NaN-aware and is careful to sort NaNs consistently, higher than any non-NaN. Also rearrange the box size calculation to not produce NaNs; instead it should produce an infinity for a box with NaN on one side and not-NaN on the other. I don't by any means claim that this solves all problems with NaNs in geometric values, but it should at least make GiST index insertion work reliably with such data. It's likely that the index search side of things still needs some work, and probably regular geometric operations too. But with this patch we're laying down a convention for how such cases ought to behave. Per bug #14238 from Guang-Dih Lei. Back-patch to 9.2; the code used before commit 7f3bd86843e5aad8 is quite different and doesn't lock up on my simple test case, nor on the submitter's dataset. Report: <20160708151747.1426.60150@wrigleys.postgresql.org> Discussion: <28685.1468246504@sss.pgh.pa.us>
* Fix handling of multixacts predating pg_upgradeAlvaro Herrera2016-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After pg_upgrade, it is possible that some tuples' Xmax have multixacts corresponding to the old installation; such multixacts cannot have running members anymore. In many code sites we already know not to read them and clobber them silently, but at least when VACUUM tries to freeze a multixact or determine whether one needs freezing, there's an attempt to resolve it to its member transactions by calling GetMultiXactIdMembers, and if the multixact value is "in the future" with regards to the current valid multixact range, an error like this is raised: ERROR: MultiXactId 123 has not been created yet -- apparent wraparound and vacuuming fails. Per discussion with Andrew Gierth, it is completely bogus to try to resolve multixacts coming from before a pg_upgrade, regardless of where they stand with regards to the current valid multixact range. It's possible to get from under this problem by doing SELECT FOR UPDATE of the problem tuples, but if tables are large, this is slow and tedious, so a more thorough solution is desirable. To fix, we realize that multixacts in xmax created in 9.2 and previous have a specific bit pattern that is never used in 9.3 and later (we already knew this, per comments and infomask tests sprinkled in various places, but we weren't leveraging this knowledge appropriately). Whenever the infomask of the tuple matches that bit pattern, we just ignore the multixact completely as if Xmax wasn't set; or, in the case of tuple freezing, we act as if an unwanted value is set and clobber it without decoding. This guarantees that no errors will be raised, and that the values will be progressively removed until all tables are clean. Most callers of GetMultiXactIdMembers are patched to recognize directly that the value is a removable "empty" multixact and avoid calling GetMultiXactIdMembers altogether. To avoid changing the signature of GetMultiXactIdMembers() in back branches, we keep the "allow_old" boolean flag but rename it to "from_pgupgrade"; if the flag is true, we always return an empty set instead of looking up the multixact. (I suppose we could remove the argument in the master branch, but I chose not to do so in this commit). This was broken all along, but the error-facing message appeared first because of commit 8e9a16ab8f7f and was partially fixed in a25c2b7c4db3. This fix, backpatched all the way back to 9.3, goes approximately in the same direction as a25c2b7c4db3 but should cover all cases. Bug analysis by Andrew Gierth and Álvaro Herrera. A number of public reports match this bug: https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
* Fix building of large (bigger than shared_buffers) hash indexes.Tom Lane2016-06-24
| | | | | | | | | | | | | | | | When the index is predicted to need more than NBuffers buckets, CREATE INDEX attempts to sort the index entries by hash key before insertion, so as to reduce thrashing. This code path got broken by commit 9f03ca915196dfc8, which overlooked that _hash_form_tuple() is not just an alias for index_form_tuple(). The index got built anyway, but with garbage data, so that searches for pre-existing tuples always failed. Fix by refactoring to separate construction of the indexable data from calling index_form_tuple(). Per bug #14210 from Daniel Newman. Back-patch to 9.5 where the bug was introduced. Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
* Finish up XLOG_HINT renamingAlvaro Herrera2016-06-17
| | | | | | | Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two places. Backpatch to 9.3, like that commit.
* Fix btree mark/restore bug.Kevin Grittner2016-06-02
| | | | | | | | | | | | | | | | | | | | | Commit 2ed5b87f96d473962ec5230fd820abfeaccb2069 introduced a bug in mark/restore, in an attempt to optimize repeated restores to the same page. This caused an assertion failure during a merge join which fed directly from an index scan, although the impact would not be limited to that case. Revert the bad chunk of code from that commit. While investigating this bug it was discovered that a particular "paranoia" set of the mark position field would not prevent bad behavior; it would just make it harder to diagnose. Change that into an assertion, which will draw attention to any future problem in that area more directly. Backpatch to 9.5, where the bug was introduced. Bug #14169 reported by Shinta Koyanagi. Preliminary analysis by Tom Lane identified which commit caused the bug.
* Fix PageAddItem BRIN bugAlvaro Herrera2016-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BRIN was relying on the ability to remove a tuple from an index page, then putting another tuple in the same line pointer. But PageAddItem refuses to add a tuple beyond the first free item past the last used item, and in particular, it rejects an attempt to add an item to an empty page anywhere other than the first line pointer. PageAddItem issues a WARNING and indicates to the caller that it failed, which in turn causes the BRIN calling code to issue a PANIC, so the whole sequence looks like this: WARNING: specified item offset is too large PANIC: failed to add BRIN tuple To fix, create a new function PageAddItemExtended which is like PageAddItem except that the two boolean arguments become a flags bitmap; the "overwrite" and "is_heap" boolean flags in PageAddItem become PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN. PageAddItem() retains its original signature, for compatibility with third-party modules (other callers in core code are not modified, either). Also, in the belt-and-suspenders spirit, I added a new sanity check in brinGetTupleForHeapBlock to raise an error if an TID found in the revmap is not marked as live by the page header. This causes it to react with "ERROR: corrupted BRIN index" to the bug at hand, rather than a hard crash. Backpatch to 9.5. Bug reported by Andreas Seltenreich as detected by his handy sqlsmith fuzzer. Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
* Fix bogus commentsAlvaro Herrera2016-05-12
| | | | | | Some comments mentioned XLogReplayBuffer, but there's no such function: that was an interim name for a function that got renamed to XLogReadBufferForRedo, before commit 2c03216d831160 was pushed.
* Fix obsolete commentAlvaro Herrera2016-05-12
|
* Fix memory leak and other bugs in ginPlaceToPage() & subroutines.Tom Lane2016-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 36a35c550ac114ca turned the interface between ginPlaceToPage and its subroutines in gindatapage.c and ginentrypage.c into a royal mess: page-update critical sections were started in one place and finished in another place not even in the same file, and the very same subroutine might return having started a critical section or not. Subsequent patches band-aided over some of the problems with this design by making things even messier. One user-visible resulting problem is memory leaks caused by the need for the subroutines to allocate storage that would survive until ginPlaceToPage calls XLogInsert (as reported by Julien Rouhaud). This would not typically be noticeable during retail index updates. It could be visible in a GIN index build, in the form of memory consumption swelling to several times the commanded maintenance_work_mem. Another rather nasty problem is that in the internal-page-splitting code path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before entering the critical section that it's supposed to be cleared in; a failure in between would leave the index in a corrupt state. There were also assorted coding-rule violations with little immediate consequence but possible long-term hazards, such as beginning an XLogInsert sequence before entering a critical section, or calling elog(DEBUG) inside a critical section. To fix, redefine the API between ginPlaceToPage() and its subroutines by splitting the subroutines into two parts. The "beginPlaceToPage" subroutine does what can be done outside a critical section, including full computation of the result pages into temporary storage when we're going to split the target page. The "execPlaceToPage" subroutine is called within a critical section established by ginPlaceToPage(), and it handles the actual page update in the non-split code path. The critical section, as well as the XLOG insertion call sequence, are both now always started and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and work in a short-lived memory context to eliminate the leakage problem. (Since a short-lived memory context had been getting created in the most common code path in the subroutines, this shouldn't cause any noticeable performance penalty; we're just moving the overhead up one call level.) In passing, fix a bunch of comments that had gone unmaintained throughout all this klugery. Report: <571276DD.5050303@dalibo.com>
* Fix memory leak in GIN index scans.Tom Lane2016-04-15
| | | | | | | | | | | The code had a query-lifespan memory leak when encountering GIN entries that have posting lists (rather than posting trees, ie, there are a relatively small number of heap tuples containing this index key value). With a suitable data distribution this could add up to a lot of leakage. Problem seems to have been introduced by commit 36a35c550, so back-patch to 9.4. Julien Rouhaud
* Fix typos in commentsAlvaro Herrera2016-03-15
|
* Fix memory leak in repeated GIN index searches.Tom Lane2016-03-13
| | | | | | | | | | | | | | | | | | | | | | | | Commit d88976cfa1302e8d removed this code from ginFreeScanKeys(): - if (entry->list) - pfree(entry->list); evidently in the belief that that ItemPointer array is allocated in the keyCtx and so would be reclaimed by the following MemoryContextReset. Unfortunately, it isn't and it won't. It'd likely be a good idea for that to become so, but as a simple and back-patchable fix in the meantime, restore this code to ginFreeScanKeys(). Also, add a similar pfree to where startScanEntry() is about to zero out entry->list. I am not sure if there are any code paths where this change prevents a leak today, but it seems like cheap future-proofing. In passing, make the initial allocation of so->entries[] use palloc not palloc0. The code doesn't depend on unused entries being zero; if it did, the array-enlargement code in ginFillScanEntry() would be wrong. So using palloc0 initially can only serve to confuse readers about what the invariant is. Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in <CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>
* Avoid unlikely data-loss scenarios due to rename() without fsync.Andres Freund2016-03-09
| | | | | | | | | | | | | | | | | | | | | Renaming a file using rename(2) is not guaranteed to be durable in face of crashes. Use the previously added durable_rename()/durable_link_or_rename() in various places where we previously just renamed files. Most of the changed call sites are arguably not critical, but it seems better to err on the side of too much durability. The most prominent known case where the previously missing fsyncs could cause data loss is crashes at the end of a checkpoint. After the actual checkpoint has been performed, old WAL files are recycled. When they're filled, their contents are fdatasynced, but we did not fsync the containing directory. An OS/hardware crash in an unfortunate moment could then end up leaving that file with its old name, but new content; WAL replay would thus not replay it. Reported-By: Tomas Vondra Author: Michael Paquier, Tomas Vondra, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches
* Fix incorrect handling of NULL index entries in indexed ROW() comparisons.Tom Lane2016-03-09
| | | | | | | | | | | | | | | | | | | | | | An index search using a row comparison such as ROW(a, b) > ROW('x', 'y') would stop upon reaching a NULL entry in the "b" column, ignoring the fact that there might be non-NULL "b" values associated with later values of "a". This happens because _bt_mark_scankey_required() marks the subsidiary scankey for "b" as required, which is just wrong: it's for a column after the one with the first inequality key (namely "a"), and thus can't be considered a required match. This bit of brain fade dates back to the very beginnings of our support for indexed ROW() comparisons, in 2006. Kind of astonishing that no one came across it before Glen Takahashi, in bug #14010. Back-patch to all supported versions. Note: the given test case doesn't actually fail in unpatched 9.1, evidently because the fix for bug #6278 (i.e., stopping at nulls in either scan direction) is required to make it fail. I'm sure I could devise a case that fails in 9.1 as well, perhaps with something involving making a cursor back up; but it doesn't seem worth the trouble.
* Ignore recovery_min_apply_delay until recovery has reached consistent stateFujii Masao2016-03-06
| | | | | | | | | | | | | | | | | | | Previously recovery_min_apply_delay was applied even before recovery had reached consistency. This could cause us to wait a long time unexpectedly for read-only connections to be allowed. It's problematic because the standby was useless during that wait time. This patch changes recovery_min_apply_delay so that it's applied once the database has reached the consistent state. That is, even if the delay is set, the standby tries to replay WAL records as fast as possible until it has reached consistency. Author: Michael Paquier Reviewed-By: Julien Rouhaud Reported-By: Greg Clough Backpatch: 9.4, where recovery_min_apply_delay was added Bug: #13770 Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org
* Revert buggy optimization of index scansSimon Riggs2016-03-03
| | | | | | | 606c0123d627 attempted to reduce cost of index scans using > and < strategies, though got that completely wrong in a few complex cases. Revert whole patch until we find a safe optimization.
* Correct StartupSUBTRANS for page wraparoundSimon Riggs2016-02-19
| | | | | | | | | | StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans data structure, which in some cases could lead to errors in startup for Hot Standby. This patch wraps the pageids correctly, avoiding any such errors. Identified by exhaustive crash testing by Jeff Janes. Jeff Janes
* Fix lossy KNN GiST when ordering operator returns non-float8 value.Teodor Sigaev2016-02-02
| | | | | | | | | | | | | | | | | KNN GiST with recheck flag should return to executor the same type as ordering operator, GiST detects this type by looking to return type of function which implements ordering operator. But occasionally detecting code works after replacing ordering operator function to distance support function. Distance support function always returns float8, so, detecting code get float8 instead of actual return type of ordering operator. Built-in opclasses don't have ordering operator which doesn't return non-float8 value, so, tests are impossible here, at least now. Backpatch to 9.5 where lozzy KNN was introduced. Author: Alexander Korotkov Report by: Artur Zakirov
* Fix misspelled function name in comment.Heikki Linnakangas2016-02-01
|
* Fix overly-strict assertions in spgtextproc.c.Tom Lane2016-01-02
| | | | | | | | | | | | | | | | | spg_text_inner_consistent is capable of reconstructing an empty string to pass down to the next index level; this happens if we have an empty string coming in, no prefix, and a dummy node label. (In practice, what is needed to trigger that is insertion of a whole bunch of empty-string values.) Then, we will arrive at the next level with in->level == 0 and a non-NULL (but zero length) in->reconstructedValue, which is valid but the Assert tests weren't expecting it. Per report from Andreas Seltenreich. This has no impact in non-Assert builds, so should not be a problem in production, but back-patch to all affected branches anyway. In passing, remove a couple of useless variable initializations and shorten the code by not duplicating DatumGetPointer() calls.
* Rename (new|old)estCommitTs to (new|old)estCommitTsXidJoe Conway2015-12-28
| | | | | | | | | | | | | The variables newestCommitTs and oldestCommitTs sound as if they are timestamps, but in fact they are the transaction Ids that correspond to the newest and oldest timestamps rather than the actual timestamps. Rename these variables to reflect that they are actually xids: to wit newestCommitTsXid and oldestCommitTsXid respectively. Also modify related code in a similar fashion, particularly the user facing output emitted by pg_controldata and pg_resetxlog. Complaint and patch by me, review by Tom Lane and Alvaro Herrera. Backpatch to 9.5 where these variables were first introduced.
* Fix brin_summarize_new_values() to check index type and ownership.Tom Lane2015-12-26
| | | | | | | | | | brin_summarize_new_values() did not check that the passed OID was for an index at all, much less that it was a BRIN index, and would fail in obscure ways if it wasn't (possibly damaging data first?). It also lacked any permissions test; by analogy to VACUUM, we should only allow the table's owner to summarize. Noted by Jeff Janes, fix by Michael Paquier and me
* Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure.Andres Freund2015-12-14
| | | | | | | | | | | | | | | | Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would install 0 into MultiXactState->offsetStopLimit if it previously succeeded. Luckily, there are no known cases where find_multixact_start() will return an error in 9.5 and above. But if it were to happen, for example due to filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId() could continue allocating mxids even if close to a wraparound, or it could erroneously stop allocating mxids, even if no wraparound is looming. The wrong value would be corrected the next time SetOffsetVacuumLimit() is called, or by a restart. Reported-By: Noah Misch, although this is not his preferred fix Discussion: 20151210140450.GA22278@alap3.anarazel.de Backpatch: 9.5, where the bug was introduced as part of 4f627f
* Fix commit timestamp initializationAlvaro Herrera2015-12-11
| | | | | | | | | | | | | | | | | | | | | | This module needs explicit initialization in order to replay WAL records in recovery, but we had broken this recently following changes to make other (stranger) scenarios work correctly. To fix, rework the initialization sequence so that it always takes place before WAL replay commences for both master and standby. I could have gone for a more localized fix that just added a "startup" call for the master server, but it seemed better to restructure the existing callers as well so that the whole thing made more sense. As a drawback, there is more control logic in xlog.c now than previously, but doing otherwise meant passing down the ControlFile flag, which seemed uglier as a whole. This also meant adding a check to not re-execute ActivateCommitTs if it had already been called. Reported by Fujii Masao. Backpatch to 9.5.
* Improve some messagesPeter Eisentraut2015-12-10
|
* Fix bug leading to restoring unlogged relations from empty files.Andres Freund2015-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in fd4ced523 / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-
* Further tweak commit_timestamp behaviorAlvaro Herrera2015-12-03
| | | | | | | | | | | | | | | | | | | | | | As pointed out by Fujii Masao, we weren't quite there on a standby behaving sanely: first because we were failing to acquire the correct state in the case where no XLOG_PARAMETER_CHANGE message was sent (because a checkpoint had already happened after the setting was changed in the master, and then the standby was restarted); and second because promoting the standby with the feature enabled failed to activate it if the master had the feature disabled. This patch fixes both those misbehaviors hopefully without re-introducing any old problems. Also change the hint emitted in a standby together with the error message about the feature being disabled, to make it point out that the place to chance the setting is the master. Otherwise, if the setting is already enabled in the standby, it is very confusing to have it say that the setting must be enabled ... Authors: Álvaro Herrera, Petr Jelínek. Backpatch to 9.5.
* Remove function names from some elog() calls in heapam.c.Andres Freund2015-11-19
| | | | | | | | | | At least one of the names was, due to a function renaming late in the development of ON CONFLICT, wrong. Since including function names in error messages is against the message style guide anyway, remove them from the messages. Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com Backpatch: 9.5, where ON CONFLICT was introduced
* Message improvementsPeter Eisentraut2015-11-16
|
* Pass extra data to bgworkers, and use this to fix parallel contexts.Robert Haas2015-11-05
| | | | | | | | | | | | | | | | | | | | | | | | Up until now, the total amount of data that could be passed to a background worker at startup was one datum, which can be a small as 4 bytes on some systems. That's enough to pass a dsm_handle or an array index, but not much else. Add a bgw_extra flag to the BackgroundWorker struct, allowing up to 128 bytes to be passed to a new worker on any platform. Use this to fix a problem I recently discovered with the parallel context machinery added in 9.5: the master assigns each worker an array index, and each worker subsequently assigns itself an array index, and there's nothing to guarantee that the two sets of indexes match, leading to chaos. Normally, I would not back-patch the change to add bgw_extra, since it is basically a feature addition. However, since 9.5 is still in beta and there seems to be no other sensible way to repair the broken parallel context machinery, back-patch to 9.5. Existing background worker code can ignore the bgw_extra field without a problem, but might need to be recompiled since the structure size has changed. Report and patch by me. Review by Amit Kapila.
* Fix serialization anomalies due to race conditions on INSERT.Kevin Grittner2015-10-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On insert the CheckForSerializableConflictIn() test was performed before the page(s) which were going to be modified had been locked (with an exclusive buffer content lock). If another process acquired a relation SIReadLock on the heap and scanned to a page on which an insert was going to occur before the page was so locked, a rw-conflict would be missed, which could allow a serialization anomaly to be missed. The window between the check and the page lock was small, so the bug was generally not noticed unless there was high concurrency with multiple processes inserting into the same table. This was reported by Peter Bailis as bug #11732, by Sean Chittenden as bug #13667, and by others. The race condition was eliminated in heap_insert() by moving the check down below the acquisition of the buffer lock, which had been the very next statement. Because of the loop locking and unlocking multiple buffers in heap_multi_insert() a check was added after all inserts were completed. The check before the start of the inserts was left because it might avoid a large amount of work to detect a serialization anomaly before performing the all of the inserts and the related WAL logging. While investigating this bug, other SSI bugs which were even harder to hit in practice were noticed and fixed, an unnecessary check (covered by another check, so redundant) was removed from heap_update(), and comments were improved. Back-patch to all supported branches. Kevin Grittner and Thomas Munro
* Message style improvementsPeter Eisentraut2015-10-28
| | | | | Message style, plurals, quoting, spelling, consistency with similar messages
* Fix BRIN free space computationsAlvaro Herrera2015-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | A bug in the original free space computation made it possible to return a page which wasn't actually able to fit the item. Since the insertion code isn't prepared to deal with PageAddItem failing, a PANIC resulted ("failed to add BRIN tuple [to new page]"). Add a macro to encapsulate the correct computation, and use it in brin_getinsertbuffer's callers before calling that routine, to raise an early error. I became aware of the possiblity of a problem in this area while working on ccc4c074994d734. There's no archived discussion about it, but it's easy to reproduce a problem in the unpatched code with something like CREATE TABLE t (a text); CREATE INDEX ti ON t USING brin (a) WITH (pages_per_range=1); for length in `seq 8000 8196` do psql -f - <<EOF TRUNCATE TABLE t; INSERT INTO t VALUES ('z'), (repeat('a', $length)); EOF done Backpatch to 9.5, where BRIN was introduced.