aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Fix overflow check and comment in GIN posting list encoding.Heikki Linnakangas2019-08-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment did not match what the code actually did for integers with the 43rd bit set. You get an integer like that, if you have a posting list with two adjacent TIDs that are more than 2^31 blocks apart. According to the comment, we would store that in 6 bytes, with no continuation bit on the 6th byte, but in reality, the code encodes it using 7 bytes, with a continuation bit on the 6th byte as normal. The decoding routine also handled these 7-byte integers correctly, except for an overflow check that assumed that one integer needs at most 6 bytes. Fix the overflow check, and fix the comment to match what the code actually does. Also fix the comment that claimed that there are 17 unused bits in the 64-bit representation of an item pointer. In reality, there are 64-32-11=21. Fitting any item pointer into max 6 bytes was an important property when this was written, because in the old pre-9.4 format, item pointers were stored as plain arrays, with 6 bytes for every item pointer. The maximum of 6 bytes per integer in the new format guaranteed that we could convert any page from the old format to the new format after upgrade, so that the new format was never larger than the old format. But we hardly need to worry about that anymore, and running into that problem during upgrade, where an item pointer is expanded from 6 to 7 bytes such that the data doesn't fit on a page anymore, is implausible in practice anyway. Backpatch to all supported versions. This also includes a little test module to test these large distances between item pointers, without requiring a 16 TB table. It is not backpatched, I'm including it more for the benefit of future development of new posting list formats. Discussion: https://www.postgresql.org/message-id/33bfc20a-5c86-f50c-f5a5-58e9925d05ff%40iki.fi Reviewed-by: Masahiko Sawada, Alexander Korotkov
* Fix bogus commentAlvaro Herrera2019-08-20
| | | | | Author: Alexander Lakhin Discussion: https://postgr.es/m/20190819072244.GE18166@paquier.xyz
* Fix predicate-locking of HOT updated rows.Heikki Linnakangas2019-08-07
| | | | | | | | | | | | | | | | | | | | | | | In serializable mode, heap_hot_search_buffer() incorrectly acquired a predicate lock on the root tuple, not the returned tuple that satisfied the visibility checks. As explained in README-SSI, the predicate lock does not need to be copied or extended to other tuple versions, but for that to work, the correct, visible, tuple version must be locked in the first place. The original SSI commit had this bug in it, but it was fixed back in 2013, in commit 81fbbfe335. But unfortunately, it was reintroduced a few months later in commit b89e151054. Wising up from that, add a regression test to cover this, so that it doesn't get reintroduced again. Also, move the code that sets 't_self', so that it happens at the same time that the other HeapTuple fields are set, to make it more clear that all the code in the loop operate on the "current" tuple in the chain, not the root tuple. Bug spotted by Andres Freund, analysis and original fix by Thomas Munro, test case and some additional changes to the fix by Heikki Linnakangas. Backpatch to all supported versions (9.4). Discussion: https://www.postgresql.org/message-id/20190731210630.nqhszuktygwftjty%40alap3.anarazel.de
* Fix nbtree metapage cache upgrade bug.Peter Geoghegan2019-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 857f9c36cda, which taught nbtree VACUUM to avoid unnecessary index scans, bumped the nbtree version number from 2 to 3, while adding the ability for nbtree indexes to be upgraded on-the-fly. Various assertions that assumed that an nbtree index was always on version 2 had to be changed to accept any supported version (version 2 or 3 on Postgres 11). However, a few assertions were missed in the initial commit, all of which were in code paths that cache a local copy of the metapage metadata, where the index had been expected to be on the current version (no longer version 2) as a generic sanity check. Rather than simply update the assertions, follow-up commit 0a64b45152b intentionally made the metapage caching code update the per-backend cached metadata version without changing the on-disk version at the same time. This could even happen when the planner needed to determine the height of a B-Tree for costing purposes. The assertions only fail on Postgres v12 when upgrading from v10, because they were adjusted to use the authoritative shared memory metapage by v12's commit dd299df8. To fix, remove the cache-only upgrade mechanism entirely, and update the assertions themselves to accept any supported version (go back to using the cached version in v12). The fix is almost a full revert of commit 0a64b45152b on the v11 branch. VACUUM only considers the authoritative metapage, and never bothers with a locally cached version, whereas everywhere else isn't interested in the metapage fields that were added by commit 857f9c36cda. It seems unlikely that this bug has affected any user on v11. Reported-By: Christoph Berg Bug: #15896 Discussion: https://postgr.es/m/15896-5b25e260fdb0b081%40postgresql.org Backpatch: 11-, where VACUUM was taught to avoid unnecessary index scans.
* Fix variable initialization when using buffering build with GiSTMichael Paquier2019-07-10
| | | | | | | | | | | | | | | | This can cause valgrind to complain, as the flag marking a buffer as a temporary copy was not getting initialized. While on it, fill in with zeros newly-created buffer pages. This does not matter when loading a block from a temporary file, but it makes the push of an index tuple into a new buffer page safer. This has been introduced by 1d27dcf, so backpatch all the way down to 9.4. Author: Alexander Lakhin Discussion: https://postgr.es/m/15899-0d24fb273b3dd90c@postgresql.org Backpatch-through: 9.4
* Fix description of WAL record XLOG_BTREE_META_CLEANUPMichael Paquier2019-06-19
| | | | | | | | | | | | | | | This record uses one metadata buffer and registers some data associated to the buffer, but when parsing the record for its description a direct access to the record data was done, but there is none. This leads usually to an incorrect description, but can also cause crashes like in pg_waldump. Instead, fix things so as the parsing uses the data associated to the metadata block. This is an oversight from 3d92796, so backpatch down to 11. Author: Michael Paquier Description: https://postgr.es/m/20190617013059.GA3153@paquier.xyz Backpatch-through: 11
* Avoid spurious deadlocks when upgrading a tuple lockAlvaro Herrera2019-06-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This puts back reverted commit de87a084c0a5, with some bug fixes. When two (or more) transactions are waiting for transaction T1 to release a tuple-level lock, and transaction T1 upgrades its lock to a higher level, a spurious deadlock can be reported among the waiting transactions when T1 finishes. The simplest example case seems to be: T1: select id from job where name = 'a' for key share; Y: select id from job where name = 'a' for update; -- starts waiting for T1 Z: select id from job where name = 'a' for key share; T1: update job set name = 'b' where id = 1; Z: update job set name = 'c' where id = 1; -- starts waiting for T1 T1: rollback; At this point, transaction Y is rolled back on account of a deadlock: Y holds the heavyweight tuple lock and is waiting for the Xmax to be released, while Z holds part of the multixact and tries to acquire the heavyweight lock (per protocol) and goes to sleep; once T1 releases its part of the multixact, Z is awakened only to be put back to sleep on the heavyweight lock that Y is holding while sleeping. Kaboom. This can be avoided by having Z skip the heavyweight lock acquisition. As far as I can see, the biggest downside is that if there are multiple Z transactions, the order in which they resume after T1 finishes is not guaranteed. Backpatch to 9.6. The patch applies cleanly on 9.5, but the new tests don't work there (because isolationtester is not smart enough), so I'm not going to risk it. Author: Oleksii Kliukin Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com Discussion: https://postgr.es/m/2815.1560521451@sss.pgh.pa.us
* Revert "Avoid spurious deadlocks when upgrading a tuple lock"Alvaro Herrera2019-06-16
| | | | | | | | | | | This reverts commits 3da73d6839dc and de87a084c0a5. This code has some tricky corner cases that I'm not sure are correct and not properly tested anyway, so I'm reverting the whole thing for next week's releases (reintroducing the deadlock bug that we set to fix). I'll try again afterwards. Discussion: https://postgr.es/m/E1hbXKQ-0003g1-0C@gemulon.postgresql.org
* Silence compiler warningAlvaro Herrera2019-06-14
| | | | Introduced in de87a084c0a5.
* Avoid spurious deadlocks when upgrading a tuple lockAlvaro Herrera2019-06-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When two (or more) transactions are waiting for transaction T1 to release a tuple-level lock, and transaction T1 upgrades its lock to a higher level, a spurious deadlock can be reported among the waiting transactions when T1 finishes. The simplest example case seems to be: T1: select id from job where name = 'a' for key share; Y: select id from job where name = 'a' for update; -- starts waiting for X Z: select id from job where name = 'a' for key share; T1: update job set name = 'b' where id = 1; Z: update job set name = 'c' where id = 1; -- starts waiting for X T1: rollback; At this point, transaction Y is rolled back on account of a deadlock: Y holds the heavyweight tuple lock and is waiting for the Xmax to be released, while Z holds part of the multixact and tries to acquire the heavyweight lock (per protocol) and goes to sleep; once X releases its part of the multixact, Z is awakened only to be put back to sleep on the heavyweight lock that Y is holding while sleeping. Kaboom. This can be avoided by having Z skip the heavyweight lock acquisition. As far as I can see, the biggest downside is that if there are multiple Z transactions, the order in which they resume after X finishes is not guaranteed. Backpatch to 9.6. The patch applies cleanly on 9.5, but the new tests don't work there (because isolationtester is not smart enough), so I'm not going to risk it. Author: Oleksii Kliukin Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
* Fix comment on when HOT update is possible.Heikki Linnakangas2019-05-14
| | | | | | | | The conditions listed in this comment have changed several times, and at some point the thing that the "if so" referred to was negated. The text was OK up to 9.6. It was differently wrong in v10, v11 and master, so fix in all those versions.
* Don't leave behind junk nbtree pages during split.Peter Geoghegan2019-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8fa30f906be reduced the elevel of a number of "can't happen" _bt_split() errors from PANIC to ERROR. At the same time, the new right page buffer for the split could continue to be acquired well before the critical section. This was possible because it was relatively straightforward to make sure that _bt_split() could not throw an error, with a few specific exceptions. The exceptional cases were safe because they involved specific, well understood errors, making it possible to consistently zero the right page before actually raising an error using elog(). There was no danger of leaving around a junk page, provided _bt_split() stuck to this coding rule. Commit 8224de4f, which introduced INCLUDE indexes, added code to make _bt_split() truncate away non-key attributes. This happened at a point that broke the rule around zeroing the right page in _bt_split(). If truncation failed (perhaps due to palloc() failure), that would result in an errant right page buffer with junk contents. This could confuse VACUUM when it attempted to delete the page, and should be avoided on general principle. To fix, reorganize _bt_split() so that truncation occurs before the new right page buffer is even acquired. A junk page/buffer will not be left behind if _bt_nonkey_truncate()/_bt_truncate() raise an error. Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com Backpatch: 11-, where INCLUDE indexes were introduced.
* Remove leftover reference to old "flat file" mechanism in a comment.Heikki Linnakangas2019-05-08
| | | | The flat file mechanism was removed in PostgreSQL 9.0.
* Fix division by zero in _bt_vacuum_needs_cleanup()Alexander Korotkov2019-04-15
| | | | | | | | | | | Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when metad->btm_last_cleanup_num_heap_tuples == 0. This commit adjusts the expression so that no division by zero might happen. Reported-by: Piotr Stefaniak Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com Reviewed-by: Masahiko Sawada Backpatch-through: 11
* Avoid counting transaction stats for parallel worker cooperatingAmit Kapila2019-04-10
| | | | | | | | | | | | | | | | | | | | | | transaction. The transaction that is initiated by the parallel worker to cooperate with the actual transaction started by the main backend to complete the query execution should not be counted as a separate transaction. The other internal transactions started and committed by the parallel worker are still counted as separate transactions as we that is what we do in other places like autovacuum. This will partially fix the bloat in transaction stats due to additional transactions performed by parallel workers. For a complete fix, we need to decide how we want to show all the transactions that are started internally for various operations and that is a matter of separate patch. Reported-by: Haribabu Kommi Author: Haribabu Kommi Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed Backpatch-through: 9.6 Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
* Fix WAL format incompatibility introduced by backpatching of 52ac6cd2d0Alexander Korotkov2019-03-24
| | | | | | | | | | | | | | | | | | | 52ac6cd2d0 added new field to ginxlogDeletePage and was backpatched to 9.4. That led to problems when patched postgres instance applies WAL records generated by non-patched one. WAL records generated by non-patched instance don't contain new field, which patched one is expecting to see. Thankfully, we can distinguish patched and non-patched WAL records by their data size. If we see that WAL record is generated by non-patched instance, we skip processing of new field. This commit comes with some assertions. In particular, if it appears that on some platform struct data size didn't change then static assertion will trigger. Reported-by: Simon Riggs Discussion: https://postgr.es/m/CANP8%2Bj%2BK4whxf7ET7%2BgO%2BG-baC3-WxqqH%3DnV4X2CgfEPA3Yu3g%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Simon Riggs, Alvaro Herrera Backpatch-through: 9.4
* Fix memory leak in printtup.c.Tom Lane2019-03-18
| | | | | | | | | | | | | | | | Commit f2dec34e1 changed things so that printtup's output stringinfo buffer was allocated outside the per-row temporary context, not inside it. This creates a need to free that buffer explicitly when the temp context is freed, but that was overlooked. In most cases, this is all happening inside a portal or executor context that will go away shortly anyhow, but that's not always true. Notably, the stringinfo ends up getting leaked when JDBC uses row-at-a-time fetches. For a query that returns wide rows, that adds up after awhile. Per bug #15700 from Matthias Otterbach. Back-patch to v11 where the faulty code was added. Discussion: https://postgr.es/m/15700-8c408321a87d56bb@postgresql.org
* Adjust error messagePeter Eisentraut2019-02-11
| | | | | | | | | We usually don't use "namespace" in user-facing error messages. Also, in master this was replaced by another error message referring to "temporary objects", so we might as well use that here to avoid introducing too many variants. Discussion: https://www.postgresql.org/message-id/bbd3f8d9-e3d5-e5aa-4305-7f0121c3fa94@2ndquadrant.com
* Fix heap_getattr() handling of fast defaults.Andres Freund2019-02-06
| | | | | | | | | | | | | | | | | | | | | | | Previously heap_getattr() returned NULL for attributes with a fast default value (c.f. 16828d5c0273), as it had no handling whatsoever for that case. A previous fix, 7636e5c60f, attempted to fix issues caused by this oversight, but just expanding OLD tuples for triggers doesn't actually solve the underlying issue. One known consequence of this bug is that the check for HOT updates can return the wrong result, when a previously fast-default'ed column is set to NULL. Which in turn means that an index over a column with fast default'ed columns might be corrupt if the underlying column(s) allow NULLs. Fix by handling fast default columns in heap_getattr(), remove now superfluous expansion in GetTupleForTrigger(). Author: Andres Freund Discussion: https://postgr.es/m/20190201162404.onngi77f26baem4g@alap3.anarazel.de Backpatch: 11, where fast defaults were introduced
* Avoid possible deadlock while locking multiple heap pages.Amit Kapila2019-02-02
| | | | | | | | | | | | | | | | | To avoid deadlock, backend acquires a lock on heap pages in block number order. In certain cases, lock on heap pages is dropped and reacquired. In this case, the locks are dropped for reading in corresponding VM page/s. The issue is we re-acquire locks in bufferId order whereas the intention was to acquire in blockid order. This commit ensures that we will always acquire locks on heap pages in blockid order. Reported-by: Nishant Fnu Author: Nishant Fnu Reviewed-by: Amit Kapila and Robert Haas Backpatch-through: 9.4 Discussion: https://postgr.es/m/5883C831-2ED1-47C8-BFAC-2D5BAE5A8CAE@amazon.com
* Fix use of dangling pointer in heap_delete() when logging replica identityMichael Paquier2019-02-01
| | | | | | | | | | | | | | When logging the replica identity of a deleted tuple, XLOG_HEAP_DELETE records include references of the old tuple. Its data is stored in an intermediate variable used to register this information for the WAL record, but this variable gets away from the stack when the record gets actually inserted. Spotted by clang's AddressSanitizer. Author: Stas Kelvish Discussion: https://postgr.es/m/085C8825-AD86-4E93-AF80-E26CDF03D1EA@postgrespro.ru Backpatch-through: 9.4
* Fix misc typos in comments.Heikki Linnakangas2019-01-23
| | | | | | Spotted mostly by Fabien Coelho. Discussion: https://www.postgresql.org/message-id/alpine.DEB.2.21.1901230947050.16643@lancre
* Restrict the use of temporary namespace in two-phase transactionsMichael Paquier2019-01-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Attempting to use a temporary table within a two-phase transaction is forbidden for ages. However, there have been uncovered grounds for a couple of other object types and commands which work on temporary objects with two-phase commit. In short, trying to create, lock or drop an object on a temporary schema should not be authorized within a two-phase transaction, as it would cause its state to create dependencies with other sessions, causing all sorts of side effects with the existing session or other sessions spawned later on trying to use the same temporary schema name. Regression tests are added to cover all the grounds found, the original report mentioned function creation, but monitoring closer there are many other patterns with LOCK, DROP or CREATE EXTENSION which are involved. One of the symptoms resulting in combining both is that the session which used the temporary schema is not able to shut down completely, waiting for being able to drop the temporary schema, something that it cannot complete because of the two-phase transaction involved with temporary objects. In this case the client is able to disconnect but the session remains alive on the backend-side, potentially blocking connection backend slots from being used. Other problems reported could also involve server crashes. This is back-patched down to v10, which is where 9b013dc has introduced MyXactFlags, something that this patch relies on. Reported-by: Alexey Bashtanov Author: Michael Paquier Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/5d910e2e-0db8-ec06-dd5f-baec420513c3@imap.cc Backpatch-through: 10
* Remove entry tree root conflict checking from GIN predicate lockingAlexander Korotkov2018-12-27
| | | | | | | | | | | | | | According to README we acquire predicate locks on entry tree leafs and posting tree roots. However, when ginFindLeafPage() is going to lock leaf in exclusive mode, then it checks root for conflicts regardless whether it's a entry or posting tree. Assuming that we never place predicate lock on entry tree root (excluding corner case when root is leaf), this check is redundant. This commit removes this check. Now, root conflict checking is controlled by separate argument of ginFindLeafPage(). Discussion: https://postgr.es/m/CAPpHfdv7rrDyy%3DMgsaK-L9kk0AH7az0B-mdC3w3p0FSb9uoyEg%40mail.gmail.com Author: Alexander Korotkov Backpatch-through: 11
* Check for conflicting queries during replay of gistvacuumpage()Alexander Korotkov2018-12-21
| | | | | | | | | | | | | | | | | | | | | | | 013ebc0a7b implements so-called GiST microvacuum. That is gistgettuple() marks index tuples as dead when kill_prior_tuple is set. Later, when new tuple insertion claims page space, those dead index tuples are physically deleted from page. When this deletion is replayed on standby, it might conflict with read-only queries. But 013ebc0a7b doesn't handle this. That may lead to disappearance of some tuples from read-only snapshots on standby. This commit implements resolving of conflicts between replay of GiST microvacuum and standby queries. On the master we implement new WAL record type XLOG_GIST_DELETE, which comprises necessary information. On stable releases we've to be tricky to keep WAL compatibility. Information required for conflict processing is just appended to data of XLOG_GIST_PAGE_UPDATE record. So, PostgreSQL version, which doesn't know about conflict processing, will just ignore that. Reported-by: Andres Freund Diagnosed-by: Andres Freund Discussion: https://postgr.es/m/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de Author: Alexander Korotkov Backpatch-through: 9.6
* Correct obsolete nbtree recovery comments.Peter Geoghegan2018-12-18
| | | | | | | | | | | | | | | Commit 40dae7ec537, which made the handling of interrupted nbtree page splits more robust, removed an nbtree-specific end-of-recovery cleanup step. This meant that it was no longer possible to complete an interrupted page split during recovery. However, a reference to recovery as a reason for using a NULL stack while inserting into a parent page was missed. Remove the reference. Remove a similar obsolete reference to recovery that was introduced much more recently, as part of the btree fastpath optimization enhancement that made it into Postgres 11 (commit 2b272734, and follow-up commits). Backpatch: 11-, where the fastpath optimization was introduced.
* Improve detection of child-process SIGPIPE failures.Tom Lane2018-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ffa4cbd62 added logic to detect SIGPIPE failure of a COPY child process, but it only worked correctly if the SIGPIPE occurred in the immediate child process. Depending on the shell in use and the complexity of the shell command string, we might instead get back an exit code of 128 + SIGPIPE, representing a shell error exit reporting SIGPIPE in the child process. We could just hack up ClosePipeToProgram() to add the extra case, but it seems like this is a fairly general issue deserving a more general and better-documented solution. I chose to add a couple of functions in src/common/wait_error.c, which is a natural place to know about wait-result encodings, that will test for either a specific child-process signal type or any child-process signal failure. Then, adjust other places that were doing ad-hoc tests of this type to use the common functions. In RestoreArchivedFile, this fixes a race condition affecting whether the process will report an error or just silently proc_exit(1): before, that depended on whether the intermediate shell got SIGTERM'd itself or reported a child process failing on SIGTERM. Like the previous patch, back-patch to v10; we could go further but there seems no real need to. Per report from Erik Rijkers. Discussion: https://postgr.es/m/f3683f87ab1701bea5d86a7742b22432@xs4all.nl
* Prevent GIN deleted pages from being reclaimed too earlyAlexander Korotkov2018-12-13
| | | | | | | | | | | | | | | | | | When GIN vacuum deletes a posting tree page, it assumes that no concurrent searchers can access it, thanks to ginStepRight() locking two pages at once. However, since 9.4 searches can skip parts of posting trees descending from the root. That leads to the risk that page is deleted and reclaimed before concurrent search can access it. This commit prevents the risk of above by waiting for every transaction, which might wait to reference this page, to finish. Due to binary compatibility we can't change GinPageOpaqueData to store corresponding transaction id. Instead we reuse page header pd_prune_xid field, which is unused in index pages. Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com Author: Andrey Borodin, Alexander Korotkov Reviewed-by: Alexander Korotkov Backpatch-through: 9.4
* Prevent deadlock in ginRedoDeletePage()Alexander Korotkov2018-12-13
| | | | | | | | | | | | | | | | | | | | On standby ginRedoDeletePage() can work concurrently with read-only queries. Those queries can traverse posting tree in two ways. 1) Using rightlinks by ginStepRight(), which locks the next page before unlocking its left sibling. 2) Using downlinks by ginFindLeafPage(), which locks at most one page at time. Original lock order was: page, parent, left sibling. That lock order can deadlock with ginStepRight(). In order to prevent deadlock this commit changes lock order to: left sibling, page, parent. Note, that position of parent in locking order seems insignificant, because we only lock one page at time while traversing downlinks. Reported-by: Chen Huajun Diagnosed-by: Chen Huajun, Peter Geoghegan, Andrey Borodin Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com Author: Alexander Korotkov Backpatch-through: 9.4
* Fix deadlock in GIN vacuum introduced by 218f51584d5Alexander Korotkov2018-12-13
| | | | | | | | | | | | | | | | | | | | | Before 218f51584d5 if posting tree page is about to be deleted, then the whole posting tree is locked by LockBufferForCleanup() on root preventing all the concurrent inserts. 218f51584d5 reduced locking to the subtree containing page to be deleted. However, due to concurrent parent split, inserter doesn't always holds pins on all the pages constituting path from root to the target leaf page. That could cause a deadlock between GIN vacuum process and GIN inserter. And we didn't find non-invasive way to fix this. This commit reverts VACUUM behavior to lock the whole posting tree before delete any page. However, we keep another useful change by 218f51584d5: the tree is locked only if there are pages to be deleted. Reported-by: Chen Huajun Diagnosed-by: Chen Huajun, Andrey Borodin, Peter Geoghegan Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com Author: Alexander Korotkov, based on ideas from Andrey Borodin and Peter Geoghegan Reviewed-by: Andrey Borodin Backpatch-through: 10
* Make TupleDescInitBuiltinEntry throw error for unsupported types.Tom Lane2018-12-10
| | | | | | | | | Previously, it would just pass back a partially-uninitialized tupdesc, which doesn't seem like a safe or useful behavior. Backpatch to v10 where this code came in. Discussion: https://postgr.es/m/30830.1544384975@sss.pgh.pa.us
* Do not decode TOAST data for table rewritesTomas Vondra2018-11-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged using XLOG / FPI records, and thus (correctly) ignored in decoding. But the associated TOAST table is WAL-logged as plain INSERT records, and so was logically decoded and passed to reorder buffer. That has severe consequences with TOAST tables of non-trivial size. Firstly, reorder buffer has to keep all those changes, possibly spilling them to a file, incurring I/O costs and disk space. Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into a hash table, which got discarded only after processing the row from the main heap. But as the main heap is not decoded for rewrites, this never happened, so all the TOAST data accumulated in memory, resulting either in excessive memory consumption or OOM. The fix is simple, as commit e9edc1ba already introduced infrastructure (namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST tables, but it only applied it to system tables. So simply use it for all TOAST data in raw_heap_insert(). That would however solve only the memory consumption issue - the TOAST changes would still be decoded and added to the reorder buffer, and spilled to disk (although without TOAST tuple data, so much smaller). But we can solve that by tweaking DecodeInsert() to just ignore such INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag, instead of skipping them later in ReorderBufferCommit(). Review: Masahiko Sawada Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com Backpatch: 9.4-, where logical decoding was introduced
* PANIC on fsync() failure.Thomas Munro2018-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
* Initialize TransactionState and user ID consistently at transaction startMichael Paquier2018-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a failure happens when a transaction is starting between the moment the transaction status is changed from TRANS_DEFAULT to TRANS_START and the moment the current user ID and security context flags are fetched via GetUserIdAndSecContext(), or before initializing its basic fields, then those may get reset to incorrect values when the transaction aborts, leaving the session in an inconsistent state. One problem reported is that failing a starting transaction at the first query of a session could cause several kinds of system crashes on the follow-up queries. In order to solve that, move the initialization of the transaction state fields and the call of GetUserIdAndSecContext() in charge of fetching the current user ID close to the point where the transaction status is switched to TRANS_START, where there cannot be any error triggered in-between, per an idea of Tom Lane. This properly ensures that the current user ID, the security context flags and that the basic fields of TransactionState remain consistent even if the transaction fails while starting. Reported-by: Richard Guo Diagnosed-By: Richard Guo Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com Backpatch-through: 9.4
* Rename rbtree.c functions to use "rbt" prefix not "rb" prefix.Tom Lane2018-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | The "rb" prefix is used by Ruby, so that our existing code results in name collisions that break plruby. We discussed ways to prevent that by adjusting dynamic linker options, but it seems that at best we'd move the pain to other cases. Renaming to avoid the collision is the only portable fix anyway. Fortunately, our rbtree code is not (yet?) widely used --- in core, there's only a single usage in GIN --- so it seems likely that we can get away with a rename. I chose to do this basically as s/rb/rbt/g, except for places where there already was a "t" after "rb". The patch could have been made smaller by only touching linker-visible symbols, but it would have resulted in oddly inconsistent-looking code. Better to make it look like "rbt" was the plan all along. Back-patch to v10. The rbtree.c code exists back to 9.5, but rb_iterate() which is the actual immediate source of pain was added in v10, so it seems like changing the names before that would have more risk than benefit. Per report from Pavel Raiskup. Discussion: https://postgr.es/m/4738198.8KVIIDhgEB@nb.usersys.redhat.com
* Fix memory leak in repeated SPGIST index scans.Tom Lane2018-10-31
| | | | | | | | | | | | | | | | | | | | | | | spgendscan neglected to pfree all the memory allocated by spgbeginscan. It's possible to get away with that in most normal queries, since the memory is allocated in the executor's per-query context which is about to get deleted anyway; but it causes severe memory leakage during creation or filling of large exclusion-constraint indexes. Also, document that amendscan is supposed to free what ambeginscan allocates. The docs' lack of clarity on that point probably caused this bug to begin with. (There is discussion of changing that API spec going forward, but I don't think it'd be appropriate for the back branches.) Per report from Bruno Wolff. It's been like this since the beginning, so back-patch to all active branches. In HEAD, also fix an independent leak caused by commit 2a6368343 (allocating memory during spgrescan instead of spgbeginscan, which might be all right if it got cleaned up, but it didn't). And do a bit of code beautification on that commit, too. Discussion: https://postgr.es/m/20181024012314.GA27428@wolff.to
* Correctly set t_self for heap tuples in expand_tupleAndrew Dunstan2018-10-24
| | | | | | | | | | | | Commit 16828d5c0 incorrectly set an invalid pointer for t_self for heap tuples. This patch correctly copies it from the source tuple, and includes a regression test that relies on it being set correctly. Backpatch to release 11. Fixes bug #15448 reported by Tillmann Schulz Diagnosis and test case by Amit Langote
* Simplify use of AllocSetContextCreate() wrapper macro.Tom Lane2018-10-12
| | | | | | | | | | | | | | | We can allow this macro to accept either abbreviated or non-abbreviated allocation parameters by making use of __VA_ARGS__. As noted by Andres Freund, it's unlikely that any compiler would have __builtin_constant_p but not __VA_ARGS__, so this gives up little or no error checking, and it avoids a minor but annoying API break for extensions. With this change, there is no reason for anybody to call AllocSetContextCreateExtended directly, so in HEAD I renamed it to AllocSetContextCreateInternal. It's probably too late for an ABI break like that in 11, though. Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
* Fix logical decoding error when system table w/ toast is repeatedly rewritten.Andres Freund2018-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Repeatedly rewriting a mapped catalog table with VACUUM FULL or CLUSTER could cause logical decoding to fail with: ERROR, "could not map filenode \"%s\" to relation OID" To trigger the problem the rewritten catalog had to have live tuples with toasted columns. The problem was triggered as during catalog table rewrites the heap_insert() check that prevents logical decoding information to be emitted for system catalogs, failed to treat the new heap's toast table as a system catalog (because the new heap is not recognized as a catalog table via RelationIsLogicallyLogged()). The relmapper, in contrast to the normal catalog contents, does not contain historical information. After a single rewrite of a mapped table the new relation is known to the relmapper, but if the table is rewritten twice before logical decoding occurs, the relfilenode cannot be mapped to a relation anymore. Which then leads us to error out. This only happens for toast tables, because the main table contents aren't re-inserted with heap_insert(). The fix is simple, add a new heap_insert() flag that prevents logical decoding information from being emitted, and accept during decoding that there might not be tuple data for toast tables. Unfortunately that does not fix pre-existing logical decoding errors. Doing so would require not throwing an error when a filenode cannot be mapped to a relation during decoding, and that seems too likely to hide bugs. If it's crucial to fix decoding for an existing slot, temporarily changing the ERROR in ReorderBufferCommit() to a WARNING appears to be the best fix. Author: Andres Freund Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de Backpatch: 9.4-, where logical decoding was introduced
* Advance transaction timestamp for intra-procedure transactions.Tom Lane2018-10-08
| | | | | | | | Per discussion, this behavior seems less astonishing than not doing so. Peter Eisentraut and Tom Lane Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
* Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.Tom Lane2018-10-06
| | | | | | | | | | | | | | | | | | | | | | Previously, a worker process would establish values for these based on its own start time. In v10 and up, this can trivially be shown to cause misbehavior of transaction_timestamp(), timestamp_in(), and related functions which are (perhaps unwisely?) marked parallel-safe. It seems likely that other behaviors might diverge from what happens in the parent as well. It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure it's still possible, so back-patch to all branches containing parallel worker infrastructure. In HEAD only, mark now() and statement_timestamp() as parallel-safe (other affected functions already were). While in theory we could still squeeze that change into v11, it doesn't seem important enough to force a last-minute catversion bump. Konstantin Knizhnik, whacked around a bit by me Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
* Allow btree comparison functions to return INT_MIN.Tom Lane2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically we forbade datatype-specific comparison functions from returning INT_MIN, so that it would be safe to invert the sort order just by negating the comparison result. However, this was never really safe for comparison functions that directly return the result of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction on those library functions. Buildfarm results show that at least on recent Linux on s390x, memcmp() actually does return INT_MIN sometimes, causing sort failures. The agreed-on answer is to remove this restriction and fix relevant call sites to not make such an assumption; code such as "res = -res" should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed in a few places that just directly negated the result of memcmp or strcmp. To help find places having this problem, I've also added a compile option to nbtcompare.c that causes some of the commonly used comparators to return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be a good idea to have at least one buildfarm member running with "-DSTRESS_SORT_INT_MIN". That's far from a complete test of course, but it should help to prevent fresh introductions of such bugs. This is a longstanding portability hazard, so back-patch to all supported branches. Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
* Fix assertion failure when updating full_page_writes for checkpointer.Amit Kapila2018-09-28
| | | | | | | | | | | | | | | | | | When the checkpointer receives a SIGHUP signal to update its configuration, it may need to update the shared memory for full_page_writes and need to write a WAL record for it. Now, it is quite possible that the XLOG machinery has not been initialized by that time and it will lead to assertion failure while doing that. Fix is to allow the initialization of the XLOG machinery outside critical section. This bug has been introduced by the commit 2c03216d83 which added the XLOG machinery initialization in RecoveryInProgress code path. Reported-by: Dilip Kumar Author: Dilip Kumar Reviewed-by: Michael Paquier and Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
* Fix WAL recycling on standbys depending on archive_modeMichael Paquier2018-09-28
| | | | | | | | | | | | | | | | | | | | | | | | A restart point or a checkpoint recycling WAL segments treats segments marked with neither ".done" (archiving is done) or ".ready" (segment is ready to be archived) in archive_status the same way for archive_mode being "on" or "always". While for a primary this is fine, a standby running a restart point with archive_mode = on would try to mark such a segment as ready for archiving, which is something that will never happen except after the standby is promoted. Note that this problem applies only to WAL segments coming from the local pg_wal the first time archive recovery is run. Segments part of a self-contained base backup are the most common case where this could happen, however even in this case normally the .done markers would be most likely part of the backup. Segments recovered from an archive are marked as .ready or .done by the startup process, and segments finished streaming are marked as such by the WAL receiver, so they are handled already. Reported-by: Haruka Takatsuka Author: Michael Paquier Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org Backpatch-through: 9.5, where archive_mode = always has been added.
* Rework activation of commit timestamps during recoveryMichael Paquier2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | The activation and deactivation of commit timestamp tracking has not been handled consistently for a primary or standbys at recovery. The facility can be activated at three different moments of recovery: - The beginning, where a primary would use the GUC value for the decision-making, and where a standby relies on the contents of the control file. - When replaying a XLOG_PARAMETER_CHANGE record at redo. - The end, where both primary and standby rely on the GUC value. Using the GUC value for a primary at the beginning of recovery causes problems with commit timestamp access when doing crash recovery. Particularly, when replaying transaction commits, it could be possible that an attempt to read commit timestamps is done for a transaction which committed at a moment when track_commit_timestamp was disabled. A test case is added to reproduce the failure. The test works down to v11 as it takes advantage of transaction commits within procedures. Reported-by: Hailong Li Author: Masahiko Sawasa, Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com Backpatch-through: 9.5, where commit timestamps have been introduced.
* Fast default trigger and expand_tuple fixesAndrew Dunstan2018-09-24
| | | | | | | | | | | | | | Ensure that triggers get properly filled in tuples for the OLD value. Also fix the logic of detecting missing null values. The previous logic failed to detect a missing null column before the first missing column with a default. Fixing this has simplified the logic a bit. Regression tests are added to test changes. This should ensure better coverage of expand_tuple(). Original bug reports, and some code and test scripts from Tomas Vondra Backpatch to release 11.
* Defer restoration of libraries in parallel workers.Thomas Munro2018-09-20
| | | | | | | | | | | | | | | | | | | | | | Several users of extensions complained of crashes in parallel workers that turned out to be due to syscache access from their _PG_init() functions. Reorder the initialization of parallel workers so that libraries are restored after the caches are initialized, and inside a transaction. This was reported in bug #15350 and elsewhere. We don't consider it to be a bug: extensions shouldn't do that, because then they can't be used in shared_preload_libraries. However, it's a fairly obscure hazard and these extensions worked in practice before parallel query came along. So let's make it work. Later commits might add a warning message and eventually an error. Back-patch to 9.6, where parallel query landed. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Kieran McCusker, Jimmy Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
* Fix minor error message style guide violation.Tom Lane2018-09-19
| | | | | | | | No periods at the ends of primary error messages, please. Daniel Gustafsson Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
* Attach FPI to the first record after full_page_writes is turned on.Amit Kapila2018-09-13
| | | | | | | | | | | | | | | | XLogInsert fails to attach a required FPI to the first record after full_page_writes is turned on by the last checkpoint. This bug got introduced in 9.5 due to code rearrangement in commits 2c03216d83 and 2076db2aea. Fix it by ensuring that XLogInsertRecord performs a recomputation when the given record is generated with FPW as off but found that the flag has been turned on while actually inserting the record. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Amit Kapila Backpatch-through: 9.5 where this problem was introduced Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
* Repair double-free in SP-GIST rescan (bug #15378)Andrew Gierth2018-09-11
| | | | | | | | | | | | | | | | | | | | | | spgrescan would first reset traversalCxt, and then traverse a potentially non-empty stack containing pointers to traversalValues which had been allocated in those contexts, freeing them a second time. This bug originates in commit ccd6eb49a where traversalValue was introduced. Repair by traversing the stack before the context reset; this isn't ideal, since it means doing retail pfree in a context that's about to be reset, but the freeing of a stack entry is also done in other places in the code during the scan so it's not worth trying to refactor it further. Regression test added. Backpatch to 9.6 where the problem was introduced. Per bug #15378; analysis and patch by me, originally from a report on IRC by user velix; see also PostGIS ticket #4174; review by Alexander Korotkov. Discussion: https://postgr.es/m/153663176628.23136.11901365223750051490@wrigleys.postgresql.org