aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Remove duplicate code in brin_memtuple_initializeTomas Vondra2020-11-11
| | | | | | | | | | Commit 8bf74967dab moved some of the code from brin_new_memtuple to brin_memtuple_initialize, but this resulted in some of the code being duplicate. Fix by removing the duplicate lines and backpatch to 10. Author: Tomas Vondra Backpatch-through: 10 Discussion: https://postgr.es/m/5eb50c97-9a8e-b691-8c40-1b2a55611c4c%40enterprisedb.com
* Fix and simplify some usages of TimestampDifference().Tom Lane2020-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce TimestampDifferenceMilliseconds() to simplify callers that would rather have the difference in milliseconds, instead of the select()-oriented seconds-and-microseconds format. This gets rid of at least one integer division per call, and it eliminates some apparently-easy-to-mess-up arithmetic. Two of these call sites were in fact wrong: * pg_prewarm's autoprewarm_main() forgot to multiply the seconds by 1000, thus ending up with a delay 1000X shorter than intended. That doesn't quite make it a busy-wait, but close. * postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute microseconds not milliseconds, thus ending up with a delay 1000X longer than intended. Somebody along the way had noticed this problem but misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather than fixing the units. This was relatively harmless in context, because we don't care that much about exactly how long this delay is; still, it's wrong. There are a few more callers of TimestampDifference() that don't have a direct need for seconds-and-microseconds, but can't use TimestampDifferenceMilliseconds() either because they do need microsecond precision or because they might possibly deal with intervals long enough to overflow 32-bit milliseconds. It might be worth inventing another API to improve that, but that seems outside the scope of this patch; so those callers are untouched here. Given the fact that we are fixing some bugs, and the likelihood that future patches might want to back-patch code that uses this new API, back-patch to all supported branches. Alexey Kondratov and Tom Lane Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
* In security-restricted operations, block enqueue of at-commit user code.Noah Misch2020-11-09
| | | | | | | | | | | | | | | | | | Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred triggers within index expressions and materialized view queries. An attacker having permission to create non-temp objects in at least one schema could execute arbitrary SQL functions under the identity of the bootstrap superuser. One can work around the vulnerability by disabling autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from pg_dump, since it runs some of those commands.) Plain VACUUM (without FULL) is safe, and all commands are fine when a trusted user owns the target object. Performance may degrade quickly under this workaround, however. Back-patch to 9.5 (all supported versions). Reviewed by Robert Haas. Reported by Etienne Stalmans. Security: CVE-2020-25695
* Properly detoast data in brin_form_tupleTomas Vondra2020-11-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | brin_form_tuple failed to consider the values may be toasted, inserting the toast pointer into the index. This may easily result in index corruption, as the toast data may be deleted and cleaned up by vacuum. The cleanup however does not care about indexes, leaving invalid toast pointers behind, which triggers errors like this: ERROR: missing chunk number 0 for toast value 16433 in pg_toast_16426 A less severe consequence are inconsistent failures due to the index row being too large, depending on whether brin_form_tuple operated on plain or toasted version of the row. For example CREATE TABLE t (val TEXT); INSERT INTO t VALUES ('... long value ...') CREATE INDEX idx ON t USING brin (val); would likely succeed, as the row would likely include toast pointer. Switching the order of INSERT and CREATE INDEX would likely fail: ERROR: index row size 8712 exceeds maximum 8152 for index "idx" because this happens before the row values are toasted. The bug exists since PostgreSQL 9.5 where BRIN indexes were introduced. So backpatch all the way back. Author: Tomas Vondra Reviewed-by: Alvaro Herrera Backpatch-through: 9.5 Discussion: https://postgr.es/m/20201001184133.oq5uq75sb45pu3aw@development Discussion: https://postgr.es/m/20201104010544.zexj52mlldagzowv%40development
* Reproduce debug_query_string==NULL on parallel workers.Noah Misch2020-10-31
| | | | | | | | | | Certain background workers initiate parallel queries while debug_query_string==NULL, at which point they attempted strlen(NULL) and died to SIGSEGV. Older debug_query_string observers allow NULL, so do likewise in these newer ones. Back-patch to v11, where commit 7de4a1bcc56f494acbd0d6e70781df877dc8ecb5 introduced the first of these. Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
* Fix missing fsync of SLRU directories.Thomas Munro2020-09-24
| | | | | | | | | | | | | Harmonize behavior by moving reponsibility for fsyncing directories down into slru.c. In 10 and later, only the multixact directories were missed (see commit 1b02be21), and in older branches all SLRUs were missed. Back-patch to all supported releases. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
* Update parallel BTree scan state when the scan keys can't be satisfied.Amit Kapila2020-09-17
| | | | | | | | | | | | | | For parallel btree scan to work for array of scan keys, it should reach BTPARALLEL_DONE state once for every distinct combination of array keys. This is required to ensure that the parallel workers don't try to seize blocks at the same time for different scan keys. We missed to update this state when we discovered that the scan keys can't be satisfied. Author: James Hunter Reviewed-by: Amit Kapila Tested-by: Justin Pryzby Backpatch-through: 10, where it was introduced Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
* Fix code for re-finding scan position in a multicolumn GIN index.Tom Lane2020-08-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | collectMatchBitmap() needs to re-find the index tuple it was previously looking at, after transiently dropping lock on the index page it's on. The tuple should still exist and be at its prior position or somewhere to the right of that, since ginvacuum never removes tuples but concurrent insertions could add one. However, there was a thinko in that logic, to the effect of expecting any inserted tuples to have the same index "attnum" as what we'd been scanning. Since there's no physical separation of tuples with different attnums, it's not terribly hard to devise scenarios where this fails, leading to transient "lost saved point in index" errors. (While I've duplicated this with manual testing, it seems impossible to make a reproducible test case with our available testing technology.) Fix by just continuing the scan when the attnum doesn't match. While here, improve the error message used if we do fail, so that it matches the wording used in btree for a similar case. collectMatchBitmap()'s posting-tree code path was previously not exercised at all by our regression tests. While I can't make a regression test that exhibits the bug, I can at least improve the code coverage here, so do that. The test case I made for this is an extension of one added by 4b754d6c1, so it only works in HEAD and v13; didn't seem worth trying hard to back-patch it. Per bug #16595 from Jesse Kinkead. This has been broken since multicolumn capability was added to GIN (commit 27cb66fdf), so back-patch to all supported branches. Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org
* Prevent concurrent SimpleLruTruncate() for any given SLRU.Noah Misch2020-08-15
| | | | | | | | | | | | | | | | | The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
* Handle new HOT chains in index-build table scansAlvaro Herrera2020-08-13
| | | | | | | | | | | | | | | | | | | | | | | | | | When a table is scanned by heapam_index_build_range_scan (née IndexBuildHeapScan) and the table lock being held allows concurrent data changes, it is possible for new HOT chains to sprout in a page that were unknown when the scan of a page happened. This leads to an error such as ERROR: failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl" because the root tuple was not present when we first obtained the list of the page's root tuples. This can be fixed by re-obtaining the list of root tuples, if we see that a heap-only tuple appears to point to a non-existing root. This was reported by Anastasia as occurring for BRIN summarization (which exists since 9.5), but I think it could theoretically also happen with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY (very recent). It seems a happy coincidence that BRIN forces us to backpatch this all the way to 9.5. Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru Backpatch: 9.5 - master
* BRIN: Handle concurrent desummarization properlyAlvaro Herrera2020-08-12
| | | | | | | | | | | | | | | If a page range is desummarized at just the right time concurrently with an index walk, BRIN would raise an error indicating index corruption. This is scary and unhelpful; silently returning that the page range is not summarized is sufficient reaction. This bug was introduced by commit 975ad4e602ff as additional protection against a bug whose actual fix was elsewhere. Backpatch equally. Reported-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru Backpatch: 9.5 - master
* Fix typo.Robert Haas2020-08-06
| | | | | Per report from Tom Lane. Previously fixed in master by commit f057980149ddccd4b862d2c6b3920ed498b0d7ec.
* Fix minor problems with non-exclusive backup cleanup.Robert Haas2020-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous coding imagined that it could call before_shmem_exit() when a non-exclusive backup began and then remove the previously-added handler by calling cancel_before_shmem_exit() when that backup ended. However, this only works provided that nothing else in the system has registered a before_shmem_exit() hook in the interim, because cancel_before_shmem_exit() is documented to remove a callback only if it is the latest callback registered. It also only works if nothing can ERROR out between the time that sessionBackupState is reset and the time that cancel_before_shmem_exit(), which doesn't seem to be strictly true. To fix, leave the handler installed for the lifetime of the session, arrange to install it just once, and teach it to quietly do nothing if there isn't a non-exclusive backup in process. This was originally committed to master as 303640199d0436c5e7acdf50b837a027b5726594, but I did not back-patch at the time because the consequences were minor. However, now there's been a second report of this causing trouble with a slightly different test case than the one I reported originally, so now I'm back-patching as far as v11 where JIT was introduced. Patch by me, reviewed by Kyotaro Horiguchi, Michael Paquier (who preferred a different approach, but got outvoted), Fujii Masao, and Tom Lane, and with comments by various others. New problem report from Bharath Rupireddy. Discussion: http://postgr.es/m/CA+TgmobMjnyBfNhGTKQEDbqXYE3_rXWpc4CM63fhyerNCes3mA@mail.gmail.com Discussion: http://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com
* Remove WARNING message from brin_desummarize_rangeAlvaro Herrera2020-07-09
| | | | | | | | | | This message was being emitted on the grounds that only crashed summarization could cause it, but in reality even an aborted vacuum could do it ... which makes it way too noisy, particularly since it shows up in regression tests and makes them die. Reported by Tom Lane. Discussion: https://postgr.es/m/489091.1593534251@sss.pgh.pa.us
* Add parens to ConvertToXSegs macroAlvaro Herrera2020-06-24
| | | | | | | | The current definition (introduced in 9a3215026bd6) is dangerous. No bugs exist in our code at present, but backpatch to 11 nonetheless, where that commit debuted. Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
* Fix masking of SP-GiST pages during xlog consistency checkAlexander Korotkov2020-06-20
| | | | | | | | | | spg_mask() didn't take into account that pd_lower equal to SizeOfPageHeaderData is still valid value. This commit fixes that. Backpatch to 11, where spg_mask() pg_lower check was introduced. Reported-by: Michael Paquier Discussion: https://postgr.es/m/20200615131405.GM52676%40paquier.xyz Backpatch-through: 11
* Fix buffile.c error handling.Thomas Munro2020-06-16
| | | | | | | | | | | | | | | | | | Convert buffile.c error handling to use ereport. This fixes cases where I/O errors were indistinguishable from EOF or not reported. Also remove "%m" from error messages where errno would be bogus. While we're modifying those strings, add block numbers and short read byte counts where appropriate. Back-patch to all supported releases. Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
* Fix locking bugs that could corrupt pg_control.Thomas Munro2020-06-08
| | | | | | | | | | | | | | | | | | The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Likewise, XLogReportParameters() must acquire ControlFileLock before modifying ControlFile and calling UpdateControlFile(). Back-patch to all supported releases. Author: Nathan Bossart <bossartn@amazon.com> Author: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
* Avoid killing btree items that are already deadAlvaro Herrera2020-05-15
| | | | | | | | | | | | | | | | | | | | | | _bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
* Prevent archive recovery from scanning non-existent WAL files.Fujii Masao2020-05-09
| | | | | | | | | | | | | | | | | | | | | Previously when there were multiple timelines listed in the history file of the recovery target timeline, archive recovery searched all of them, starting from the newest timeline to the oldest one, to find the segment to read. That is, archive recovery had to continuously fail scanning the segment until it reached the timeline that the segment belonged to. These scans for non-existent segment could be harmful on the recovery performance especially when archival area was located on the remote storage and each scan could take a long time. To address the issue, this commit changes archive recovery so that it skips scanning the timeline that the segment to read doesn't belong to. Per discussion, back-patch to all supported versions. Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com
* Report missing wait event for timeline history file.Fujii Masao2020-05-08
| | | | | | | | | | | | | | | | TimelineHistoryRead and TimelineHistoryWrite wait events are reported during waiting for a read and write of a timeline history file, respectively. However, previously, TimelineHistoryRead wait event was not reported while readTimeLineHistory() was reading a timeline history file. Also TimelineHistoryWrite was not reported while writeTimeLineHistory() was writing one line with the details of the timeline split, at the end. This commit fixes these issues. Back-patch to v10 where wait events for a timeline history file was added. Author: Masahiro Ikeda Reviewed-by: Michael Paquier, Fujii Masao Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
* Clear up issue with FSM and oldest bpto.xact.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | On further reflection, code comments added by commit b0229f26 slightly misrepresented how we determine the oldest bpto.xact for the index. btvacuumpage() does not treat the bpto.xact of a page that it put in the FSM as a candidate to be the oldest deleted page (the delete-marked page that has the oldest bpto.xact XID among all pages encountered). The definition of a deleted page for the purposes of the bpto.xact calculation is different from the definition used by the bulk delete statistics. The bulk delete statistics don't distinguish between pages that were deleted by the current VACUUM, pages deleted by a previous VACUUM operation but not yet recyclable/reusable, and pages that are reusable (though reusable pages are counted separately). Backpatch: 11-, just like commit b0229f26.
* Fix undercounting in VACUUM VERBOSE output.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic for determining how many nbtree pages in an index are deleted pages sometimes undercounted pages. Pages that were deleted by the current VACUUM operation (as opposed to some previous VACUUM operation whose deleted pages have yet to be reused) were sometimes overlooked. The final count is exposed to users through VACUUM VERBOSE's "%u index pages have been deleted" output. btvacuumpage() avoided double-counting when _bt_pagedel() deleted more than one page by assuming that only one page was deleted, and that the additional deleted pages would get picked up during a future call to btvacuumpage() by the same VACUUM operation. _bt_pagedel() can legitimately delete pages that the btvacuumscan() scan will not visit again, though, so that assumption was slightly faulty. Fix the accounting by teaching _bt_pagedel() about its caller's requirements. It now only reports on pages that it knows btvacuumscan() won't visit again (including the current btvacuumpage() page), so everything works out in the end. This bug has been around forever. Only backpatch to v11, though, to keep _bt_pagedel() is sync on the branches that have today's bugfix commit b0229f26da. Note that this commit changes the signature of _bt_pagedel(), just like commit b0229f26da. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-
* Fix bug in nbtree VACUUM "skip full scan" feature.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 857f9c36cda (which taught nbtree VACUUM to skip a scan of the index from btcleanup in situations where it doesn't seem worth it) made VACUUM maintain the oldest btpo.xact among all deleted pages for the index as a whole. It failed to handle all the details surrounding pages that are deleted by the current VACUUM operation correctly (though pages deleted by some previous VACUUM operation were processed correctly). The most immediate problem was that the special area of the page was examined without a buffer pin at one point. More fundamentally, the handling failed to account for the full range of _bt_pagedel() behaviors. For example, _bt_pagedel() sometimes deletes internal pages in passing, as part of deleting an entire subtree with btvacuumpage() caller's page as the leaf level page. The original leaf page passed to _bt_pagedel() might not be the page that it deletes first in cases where deletion can take place. It's unclear how disruptive this bug may have been, or what symptoms users might want to look out for. The issue was spotted during unrelated code review. To fix, push down the logic for maintaining the oldest btpo.xact to _bt_pagedel(). btvacuumpage() is now responsible for pages that were fully deleted by a previous VACUUM operation, while _bt_pagedel() is now responsible for pages that were deleted by the current VACUUM operation (this includes half-dead pages from a previous interrupted VACUUM operation that become fully deleted in _bt_pagedel()). Note that _bt_pagedel() should never encounter an existing deleted page. This commit theoretically breaks the ABI of a stable release by changing the signature of _bt_pagedel(). However, if any third party extension is actually affected by this, then it must already be completely broken (since there are numerous assumptions made in _bt_pagedel() that cannot be met outside of VACUUM). It seems highly unlikely that such an extension actually exists, in any case. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-, where the "skip full scan" feature was introduced.
* Fix handling of WAL segments ready to be archived during crash recoveryMichael Paquier2020-04-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 78ea8b5 has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5
* Fix possible crash during FATAL exit from reindexing.Tom Lane2020-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | index.c supposed that it could just use a PG_TRY block to clean up the state associated with an active REINDEX operation. However, that code doesn't run if we do a FATAL exit --- for example, due to a SIGTERM shutdown signal --- while the REINDEX is happening. And that state does get consulted during catalog accesses, which makes it problematic if we do any catalog accesses during shutdown --- for example, to clean up any temp tables created in the session. If this combination of circumstances occurred, we could find ourselves trying to access already-freed memory. In debug builds that'd fairly reliably cause an assertion failure. In production we might often get away with it, but with some bad luck it could cause a core dump. Another possible bad outcome is an erroneous conclusion that an index-to-be-accessed is being reindexed; but it looks like that would be unlikely to have any consequences worse than failing to drop temp tables right away. (They'd still get dropped by the next session that uses that temp schema.) To fix, get rid of the use of PG_TRY here, and instead hook into the transaction abort mechanisms to clean up reindex state. Per bug #16378 from Alexander Lakhin. This has been wrong for a very long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/16378-7a70ca41b3ec2009@postgresql.org
* Allow parallel create index to accumulate buffer usage stats.Amit Kapila2020-04-09
| | | | | | | | | | | | | | Currently, we don't account for buffer usage incurred by parallel workers for parallel create index.  This commit allows each worker to record the buffer usage stats and leader backend to accumulate that stats at the end of the operation.  This will allow pg_stat_statements to display correct buffer usage stats for (parallel) create index command. Reported-by: Julien Rouhaud Author: Sawada Masahiko Reviewed-by: Dilip Kumar, Julien Rouhaud and Amit Kapila Backpatch-through: 11, where this was introduced Discussion: https://postgr.es/m/20200328151721.GB12854@nol
* Use TransactionXmin instead of RecentGlobalXmin in heap_abort_speculative().Andres Freund2020-04-05
| | | | | | | | | | | | | | | There's a very low risk that RecentGlobalXmin could be far enough in the past to be older than relfrozenxid, or even wrapped around. Luckily the consequences of that having happened wouldn't be too bad - the page wouldn't be pruned for a while. Avoid that risk by using TransactionXmin instead. As that's announced via MyPgXact->xmin, it is protected against wrapping around (see code comments for details around relfrozenxid). Author: Andres Freund Discussion: https://postgr.es/m/20200328213023.s4eyijhdosuc4vcj@alap3.anarazel.de Backpatch: 9.5-
* Fix bugs in gin_fuzzy_search_limit processing.Tom Lane2020-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | entryGetItem()'s three code paths each contained bugs associated with filtering the entries for gin_fuzzy_search_limit. The posting-tree path failed to advance "advancePast" after having decided to filter an item. If we ran out of items on the current page and needed to advance to the next, what would actually happen is that entryLoadMoreItems() would re-load the same page. Eventually, the random dropItem() test would accept one of the same items it'd previously rejected, and we'd move on --- but it could take awhile with small gin_fuzzy_search_limit. To add insult to injury, this case would inevitably cause entryLoadMoreItems() to decide it needed to re-descend from the root, making things even slower. The posting-list path failed to implement gin_fuzzy_search_limit filtering at all, so that all entries in the posting list would be returned. The bitmap-result path used a "gotitem" variable that it failed to update in the one place where it'd actually make a difference, ie at the one "continue" statement. I think this was unreachable in practice, because if we'd looped around then it shouldn't be the case that the entries on the new page are before advancePast. Still, the "gotitem" variable was contributing nothing to either clarity or correctness, so get rid of it. Refactor all three loops so that the termination conditions are more alike and less unreadable. The code coverage report showed that we had no coverage at all for the re-descend-from-root code path in entryLoadMoreItems(), which seems like a very bad thing, so add a test case that exercises it. We also had exactly no coverage for gin_fuzzy_search_limit, so add a simplistic test case that at least hits those code paths a little bit. Back-patch to all supported branches. Adé Heyward and Tom Lane Discussion: https://postgr.es/m/CAEknJCdS-dE1Heddptm7ay2xTbSeADbkaQ8bU2AXRCVC2LdtKQ@mail.gmail.com
* Revert "Skip WAL for new relfilenodes, under wal_level=minimal."Noah Misch2020-03-22
| | | | | | | | This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
* Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch2020-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
* Back-patch log_newpage_range().Noah Misch2020-03-21
| | | | | | | | | | | Back-patch a subset of commit 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b to v11, v10, 9.6, and 9.5. Include the latest repairs to this function. Use a new XLOG_FPI_MULTI value instead of reusing XLOG_FPI. That way, if an older server reads WAL from this function, that server will PANIC instead of applying just one page of the record. The next commit adds a call to this function. Discussion: https://postgr.es/m/20200304.162919.898938381201316571.horikyota.ntt@gmail.com
* During heap rebuild, lock any TOAST index until end of transaction.Noah Misch2020-03-21
| | | | | | | | | | | | swap_relation_files() calls toast_get_valid_index() to find and lock this index, just before swapping with the rebuilt TOAST index. The latter function releases the lock before returning. Potential for mischief is low; a concurrent session can issue ALTER INDEX ... SET (fillfactor = ...), which is not alarming. Nonetheless, changing pg_class.relfilenode without a lock is unconventional. Back-patch to 9.5 (all supported versions), because another fix needs this. Discussion: https://postgr.es/m/20191226001521.GA1772687@rfd.leadboat.com
* Avoid assertion failure with targeted recovery in standby mode.Fujii Masao2020-03-09
| | | | | | | | | | | | | | | | | | | | | | | | At the end of recovery, standby mode is turned off to re-fetch the last valid record from archive or pg_wal. Previously, if recovery target was reached and standby mode was turned off while the current WAL source was stream, recovery could try to retrieve WAL file containing the last valid record unexpectedly from stream even though not in standby mode. This caused an assertion failure. That is, the assertion test confirms that WAL file should not be retrieved from stream if standby mode is not true. This commit moves back the current WAL source to archive if it's stream even though not in standby mode, to avoid that assertion failure. This issue doesn't cause the server to crash when built with assertion disabled. In this case, the attempt to retrieve WAL file from stream not in standby mode just fails. And then recovery tries to retrieve WAL file from archive or pg_wal. Back-patch to all supported branches. Author: Kyotaro Horiguchi Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/20200227.124830.2197604521555566121.horikyota.ntt@gmail.com
* Store the deletion horizon XID for a deleted GIN page on the right page.Tom Lane2020-02-09
| | | | | | | | | | | | | | | | | Commit b10714080 moved the GinPageSetDeleteXid() call to a spot where the "page" variable was pointing to the wrong page, causing the XID to be inserted on a page that's not being deleted, thus allowing later GinPageIsRecyclable tests to recycle the deleted page too soon. It might be a good idea to stop using the single "page" variable for multiple purposes in this function. But for the moment I just moved the GinPageSetDeleteXid() call down beside the GinPageSetDeleted() call, which seems like a more logical place for it anyway. Back-patch to v11, as the faulty patch was. (Fortunately, the bug hasn't made it into any release yet.) Discussion: https://postgr.es/m/21620.1581098806@sss.pgh.pa.us
* Force tuple conversion when the source has missing attributes.Andrew Gierth2020-02-05
| | | | | | | | | | | | | | | | | | | | | Tuple conversion incorrectly concluded that no conversion was needed as long as all the attributes lined up. But if the source tuple has a missing attribute (from addition of a column with default), then the destination tupdesc might not reflect the same default. The typical symptom was that the affected columns would be unexpectedly NULL. Repair by always forcing conversion if the source has missing attributes, which will be filled in by the deform operation. (In theory we could optimize for when the destination has the same default, but that seemed overkill.) Backpatch to 11 where missing attributes were added. Per bug #16242. Vik Fearing (discovery, code, testing) and me (analysis, testcase). Discussion: https://postgr.es/m/16242-d1c9fca28445966b@postgresql.org
* Handle lack of DSM slots in parallel btree build, take 2.Thomas Munro2020-02-05
| | | | | | | | | | Commit 74618e77 added a new check intended to fix a bug, but put it in the wrong place so that parallel btree build was always disabled. Do the check after we've actually tried to create a DSM segment. Back-patch to 11, like the earlier commit. Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-WzmDABkJzrNnvf%2BOULK-_A_j9gkYg_Dz-H62jzNv4eKQTw%40mail.gmail.com
* Handle lack of DSM slots in parallel btree build.Thomas Munro2020-01-31
| | | | | | | | | | | | | If no DSM slots are available, a ParallelContext can still be created, but its seg pointer is NULL. Teach parallel btree build to cope with that by falling back to a regular non-parallel build, to avoid crashing with a segmentation fault. Back-patch to 11, where parallel CREATE INDEX landed. Reported-by: Nicola Contu Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/CA%2BhUKGJgJEBnkuODBVomyK3MWFvDBbMVj%3Dgdt6DnRPU-5sQ6UQ%40mail.gmail.com
* Fix crash in BRIN inclusion op functions, due to missing datum copy.Heikki Linnakangas2020-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | The BRIN add_value() and union() functions need to make a longer-lived copy of the argument, if they want to store it in the BrinValues struct also passed as argument. The functions for the "inclusion operator classes" used with box, range and inet types didn't take into account that the union helper function might return its argument as is, without making a copy. Check for that case, and make a copy if necessary. That case arises at least with the range_union() function, when one of the arguments is an 'empty' range: CREATE TABLE brintest (n numrange); CREATE INDEX brinidx ON brintest USING brin (n); INSERT INTO brintest VALUES ('empty'); INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric)); INSERT INTO brintest VALUES ('(-1, 0)'); SELECT brin_desummarize_range('brinidx', 0); SELECT brin_summarize_range('brinidx', 0); Backpatch down to 9.5, where BRIN was introduced. Discussion: https://www.postgresql.org/message-id/e6e1d6eb-0a67-36aa-e779-bcca59167c14%40iki.fi Reviewed-by: Emre Hasegeli, Tom Lane, Alvaro Herrera
* Reimplement nullification of walsender timestampAlvaro Herrera2020-01-08
| | | | | | | | | | | | | | | Make the value null only at pg_stat_activity-output time, as suggested by Tom Lane, instead of messing with the internal state. This should appease buildfarm members with force_parallel_mode=regress, which are running parallel queries on logical replication walsenders. The fact that walsenders can run parallel queries should perhaps be studied more carefully, but for the moment let's get rid of the red blots in buildfarm. Backpatch to pg10, like the previous commit. Discussion: https://postgr.es/m/30804.1578438763@sss.pgh.pa.us
* pg_stat_activity: show NULL stmt start time for walsendersAlvaro Herrera2020-01-07
| | | | | | | | | | | | | Returning a non-NULL time is pointless, sinc a walsender is not a process that would be running normal transactions anyway, but the code was unintentionally exposing the process start time intermittently, which was not only bogus but it also confused monitoring systems looking for idle transactions. Fix by avoiding all updates in walsenders. Backpatch to 11, where walsenders started appearing in pg_stat_activity. Reported-by: Tomas Vondra Discussion: https://postgr.es/m/20191209234409.exe7osmyalwkt5j4@development
* Remove shadow variables linked to RedoRecPtr in xlog.cMichael Paquier2019-12-18
| | | | | | | | | | | | | | | This changes the routines in charge of recycling WAL segments past the last redo LSN to not use anymore "RedoRecPtr" as a local variable, which is also available in the context of the session as a static declaration, replacing it with "lastredoptr". This confusion has been introduced by d9fadbf, so backpatch down to v11 like the other commit. Thanks to Tom Lane, Robert Haas, Alvaro Herrera, Mark Dilger and Kyotaro Horiguchi for the input provided. Author: Ranier Vilela Discussion: https://postgr.es/m/MN2PR18MB2927F7B5F690065E1194B258E35D0@MN2PR18MB2927.namprd18.prod.outlook.com Backpatch-through: 11
* Change overly strict Assert in TransactionGroupUpdateXidStatus.Amit Kapila2019-12-17
| | | | | | | | | | | | | | | | | | This Assert thought that an overflowed transaction can never get registered for the group update.  But that is not true, because even when the number of children for a transaction got reduced, the overflow flag is not changed. And, for group update, we only care about the current number of children for a transaction that is being committed. Based on comments by Andres Freund, remove a redundant Assert in TransactionIdSetPageStatus as we already had a static Assert for the same condition a few lines earlier. Reported-by: Vignesh C Author: Dilip Kumar Reviewed-by: Amit Kapila Backpatch-through: 11 Discussion: https://postgr.es/m/CAFiTN-s5=uJw-Z6JC9gcqtBSjXsrHnU63PXBrA=pnBjqnkm5UA@mail.gmail.com
* Avoid assertion failure with LISTEN in a serializable transaction.Tom Lane2019-11-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If LISTEN is the only action in a serializable-mode transaction, and the session was not previously listening, and the notify queue is not empty, predicate.c reported an assertion failure. That happened because we'd acquire the transaction's initial snapshot during PreCommit_Notify, which was called *after* predicate.c expects any such snapshot to have been established. To fix, just swap the order of the PreCommit_Notify and PreCommit_CheckForSerializationFailure calls during CommitTransaction. This will imply holding the notify-insertion lock slightly longer, but the difference could only be meaningful in serializable mode, which is an expensive option anyway. It appears that this is just an assertion failure, with no consequences in non-assert builds. A snapshot used only to scan the notify queue could not have been involved in any serialization conflicts, so there would be nothing for PreCommit_CheckForSerializationFailure to do except assign it a prepareSeqNo and set the SXACT_FLAG_PREPARED flag. And given no conflicts, neither of those omissions affect the behavior of ReleasePredicateLocks. This admittedly once-over-lightly analysis is backed up by the lack of field reports of trouble. Per report from Mark Dilger. The bug is old, so back-patch to all supported branches; but the new test case only goes back to 9.6, for lack of adequate isolationtester infrastructure before that. Discussion: https://postgr.es/m/3ac7f397-4d5f-be8e-f354-440020675694@gmail.com Discussion: https://postgr.es/m/13881.1574557302@sss.pgh.pa.us
* Fix page modification outside of critical section in GINAlexander Korotkov2019-11-20
| | | | | | | | By oversight 52ac6cd2d0 makes ginDeletePage() sets pd_prune_xid of page to be deleted before entering the critical section. It appears that only versions 11 and later were affected by this oversight. Backpatch-through: 11
* Revise GIN READMEAlexander Korotkov2019-11-20
| | | | | | | | | | | | | | | | | We find GIN concurrency bugs from time to time. One of the problems here is that concurrency of GIN isn't well-documented in README. So, it might be even hard to distinguish design bugs from implementation bugs. This commit revised concurrency section in GIN README providing more details. Some examples are illustrated in ASCII art. Also, this commit add the explanation of how is tuple layout in internal GIN B-tree page different in comparison with nbtree. Discussion: https://postgr.es/m/CAPpHfduXR_ywyaVN4%2BOYEGaw%3DcPLzWX6RxYLBncKw8de9vOkqw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
* Fix traversing to the deleted GIN page via downlinkAlexander Korotkov2019-11-20
| | | | | | | | | | | | | | | Current GIN code appears to don't handle traversing to the deleted page via downlink. This commit fixes that by stepping right from the delete page like we do in nbtree. This commit also fixes setting 'deleted' flag to the GIN pages. Now other page flags are not erased once page is deleted. That helps to keep our assertions true if we arrive deleted page via downlink. Discussion: https://postgr.es/m/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
* Fix deadlock between ginDeletePage() and ginStepRight()Alexander Korotkov2019-11-20
| | | | | | | | | | | | | | | | | When ginDeletePage() is about to delete page it locks its left sibling to revise the rightlink. So, it locks pages in right to left manner. Int he same time ginStepRight() locks pages in left to right manner, and that could cause a deadlock. This commit makes ginScanToDelete() keep exclusive lock on left siblings of currently investigated path. That elimites need to relock left sibling in ginDeletePage(). Thus, deadlock with ginStepRight() can't happen anymore. Reported-by: Chen Huajun Discussion: https://postgr.es/m/5c332bd1.87b6.16d7c17aa98.Coremail.chjischj%40163.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 10
* Fix assertion failure when running pgbench -s.Fujii Masao2019-11-07
| | | | | | | | | | | | | | | | | | If there is the WAL page that the continuation WAL record just fits within (i.e., the continuation record ends just at the end of the page) and the LSN in such page is specified with -s option, previously pg_waldump caused an assertion failure. The cause of this assertion failure was that XLogFindNextRecord() that pg_waldump -s calls mistakenly handled such special WAL page. This commit changes XLogFindNextRecord() so that it can handle such WAL page correctly. Back-patch to all supported versions. Author: Andrey Lepikhov Reviewed-by: Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/99303554-5dd5-06e6-f943-b3005ccd6edd@postgrespro.ru
* Fix failure of archive recovery with recovery_min_apply_delay enabled.Fujii Masao2019-10-18
| | | | | | | | | | | | | | | | | | | | | | | | recovery_min_apply_delay parameter is intended for use with streaming replication deployments. However, the document clearly explains that the parameter will be honored in all cases if it's specified. So it should take effect even if in archive recovery. But, previously, archive recovery with recovery_min_apply_delay enabled always failed, and caused assertion failure if --enable-caasert is enabled. The cause of this problem is that; the ownership of recoveryWakeupLatch that recovery_min_apply_delay uses was taken only when standby mode is requested. So unowned latch could be used in archive recovery, and which caused the failure. This commit changes recovery code so that the ownership of recoveryWakeupLatch is taken even in archive recovery. Which prevents archive recovery with recovery_min_apply_delay from failing. Back-patch to v9.4 where recovery_min_apply_delay was added. Author: Fujii Masao Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com