aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Fix bulk table extension when copying into multiple partitionsAndres Freund2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When COPYing into a partitioned table that does now permit the use of table_multi_insert(), we could error out with ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes because BulkInsertState->next_free was not reset between partitions. This problem occurred only when not able to use table_multi_insert(), as a dedicated BulkInsertState for each partition is used in that case. The bug was introduced in 00d1e02be24, but it was hard to hit at that point, as commonly bulk relation extension is not used when not using table_multi_insert(). It became more likely after 82a4edabd27, which expanded the use of bulk extension. To fix the bug, reset the bulk relation extension state in BulkInsertState in ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very similar issue. Obviously the name is not quite correct, but there might be external callers, and bulk insert state needs to be reset in precisely in the situations that ReleaseBulkInsertStatePin() already needed to be called. Medium term the better fix likely is to disallow reusing BulkInsertState across relations. Add a test that, without the fix, reproduces #18130 in most configurations. The test also catches the problem fixed in b1ecb9b3fcf when run with small shared_buffers. Reported-by: Ivan Kolombet <enderstd@gmail.com> Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us> Analyzed-by: Andres Freund <andres@anarazel.de> Bug: #18130 Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us Backpatch: 16-
* Improve the naming in wal_sync_method code.Nathan Bossart2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
* Add wait events for checkpoint delay mechanism.Thomas Munro2023-10-13
| | | | | | | | | | | When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
* Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas2023-10-12
| | | | | | | | | | | | | | | | | An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
* Reindent comment in GenericXLogFinish().Tom Lane2023-10-11
| | | | Restore pgindent cleanliness, per buildfarm member koel.
* Fix bug in GenericXLogFinish().Jeff Davis2023-10-10
| | | | | | | | Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
* Add const to values and nulls argumentsPeter Eisentraut2023-10-10
| | | | | | | This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
* Remove duplicate words in docs and code comments.Amit Kapila2023-10-09
| | | | | | | Additionally, add a missing "the" in a couple of places. Author: Vignesh C, Dagfinn Ilmari Mannsåker Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
* Fix another typo in e0b1ee17dcAlexander Korotkov2023-10-07
| | | | | Reported-by: Richard Guo Discussion: https://postgr.es/m/CAMbWs4_kHMJDak75y1kBTirv-drS1-knT-7Mpg5LprAjqRJDVA%40mail.gmail.com
* Fix typos in e0b1ee17dcAlexander Korotkov2023-10-07
| | | | Reported-by: Alexander Lakhin
* Skip checking of scan keys required for directional scan in B-treeAlexander Korotkov2023-10-06
| | | | | | | | | | | | | | | | | | | | | Currently, B-tree code matches every scan key to every item on the page. Imagine the ordered B-tree scan for the query like this. SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col; The (col > 'a') scan key will be always matched once we find the location to start the scan. The (col < 'b') scan key will match every item on the page as long as it matches the last item on the page. This patch implements prechecking of the scan keys required for directional scan on beginning of page scan. If precheck is successful we can skip this scan keys check for the items on the page. That could lead to significant acceleration especially if the comparison operator is expensive. Idea from patch by Konstantin Knizhnik. Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru Reviewed-by: Peter Geoghegan, Pavel Borisov
* Move BuildDescForRelation() from tupdesc.c to tablecmds.cPeter Eisentraut2023-10-05
| | | | | | | | | | | | | | BuildDescForRelation() main job is to convert ColumnDef lists to pg_attribute/tuple descriptor arrays, which is really mostly an internal subroutine of DefineRelation() and some related functions, which is more the remit of tablecmds.c and doesn't have much to do with the basic tuple descriptor interfaces in tupdesc.c. This is also supported by observing the header includes we can remove in tupdesc.c. By moving it over, we can also (in the future) make BuildDescForRelation() use more internals of tablecmds.c that are not sensible to be exposed in tupdesc.c. Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
* Push attidentity and attgenerated handling into BuildDescForRelation()Peter Eisentraut2023-10-05
| | | | | | | | | Previously, this was handled by the callers separately, but it can be trivially moved into BuildDescForRelation() so that it is handled in a central place. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
* Tidy-up some appendStringInfo*() usagesDavid Rowley2023-10-03
| | | | | | | | | | | | Make a few newish calls to appendStringInfo() which have no special formatting use appendStringInfoString() instead. Also, adjust usages of appendStringInfoString() which only append a string containing a single character to make use of appendStringInfoChar() instead. This makes the code marginally faster, but primarily this change is so we use the StringInfo type as it was intended to be used. Discussion: https://postgr.es/m/CAApHDvpXKQmL+r=VDNS98upqhr9yGBhv2Jw3GBFFk_wKHcB39A@mail.gmail.com
* Fail hard on out-of-memory failures in xlogreader.cMichael Paquier2023-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
* Remove retry loop in heap_page_prune().Robert Haas2023-10-02
| | | | | | | | | | | | | | | The retry loop is needed because heap_page_prune() calls HeapTupleSatisfiesVacuum() and then lazy_scan_prune() does the same thing again, and they might get different answers due to concurrent clog updates. But this patch makes heap_page_prune() return the HeapTupleSatisfiesVacuum() results that it computed back to the caller, which allows lazy_scan_prune() to avoid needing to recompute those values in the first place. That's nice both because it eliminates the need for a retry loop and also because it's cheaper. Melanie Plageman, reviewed by David Geier, Andres Freund, and me. Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
* Add rmgrdesc READMEHeikki Linnakangas2023-10-02
| | | | | | | | | | In the README, briefly explain what rmgrdesc functions are, and why they are in a separate directory. Commit c03c2eae0a added some guidelines on the preferred output format; move that to the README too. Reviewed-by: Melanie Plageman, Peter Geoghegan Discussion: https://www.postgresql.org/message-id/9159daf7-f42d-781b-458f-1b2cf32cb256%40iki.fi
* Correct assertion and comments about XLogRecordMaxSize.Noah Misch2023-10-01
| | | | | | The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf(). Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com
* Fix btmarkpos/btrestrpos array key wraparound bug.Peter Geoghegan2023-09-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit 70bc5833, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit 9e8da0f7, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).
* Return data from heap_page_prune via a struct.Robert Haas2023-09-28
| | | | | | | | | | | Previously, one of the values in the struct was returned as the return value, and another was returned via an output parameter. In preparation for returning more stuff, consolidate both values into a struct returned via an output parameter. Melanie Plageman, reviewed by Andres Freund and by me. Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
* Fix typo in src/backend/access/transam/README.Etsuro Fujita2023-09-28
|
* Add TupleDescGetDefault()Peter Eisentraut2023-09-27
| | | | | | | | | | | | | | | This unifies some repetitive code. Note: I didn't push the "not found" error message into the new function, even though all existing callers would be able to make use of it. Using the existing error handling as-is would probably require exposing the Relation type via tupdesc.h, which doesn't seem desirable. (Or even if we changed it to just report the OID, it would inject the concept of a relation containing the tuple descriptor into tupdesc.h, which might be a layering violation. Perhaps some further improvements could be considered here separately.) Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org
* MergeAttributes() and related variable renamingPeter Eisentraut2023-09-26
| | | | | | | Mainly, rename "schema" to "columns" and related changes. The previous naming has long been confusing. Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org
* Fix another bug in parent page splitting during GiST index build.Heikki Linnakangas2023-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yet another bug in the ilk of commits a7ee7c851 and 741b88435. In 741b88435, we took care to clear the memorized location of the downlink when we split the parent page, because splitting the parent page can move the downlink. But we missed that even *updating* a tuple on the parent can move it, because updating a tuple on a gist page is implemented as a delete+insert, so the updated tuple gets moved to the end of the page. This commit fixes the bug in two different ways (belt and suspenders): 1. Clear the downlink when we update a tuple on the parent page, even if it's not split. This the same approach as in commits a7ee7c851 and 741b88435. I also noticed that gistFindCorrectParent did not clear the 'downlinkoffnum' when it stepped to the right sibling. Fix that too, as it seems like a clear bug even though I haven't been able to find a test case to hit that. 2. Change gistFindCorrectParent so that it treats 'downlinkoffnum' merely as a hint. It now always first checks if the downlink is still at that location, and if not, it scans the page like before. That's more robust if there are still more cases where we fail to clear 'downlinkoffnum' that we haven't yet uncovered. With this, it's no longer necessary to meticulously clear 'downlinkoffnum', so this makes the previous fixes unnecessary, but I didn't revert them because it still seems nice to clear it when we know that the downlink has moved. Also add the test case using the same test data that Alexander posted. I tried to reduce it to a smaller test, and I also tried to reproduce this with different test data, but I was not able to, so let's just include what we have. Backpatch to v12, like the previous fixes. Reported-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/18129-caca016eaf0c3702@postgresql.org
* Fix edge-case for xl_tot_len broken by bae868ca.Thomas Munro2023-09-26
| | | | | | | | | | | | | | | | | bae868ca removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like bae868ca. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
* Don't trust unvalidated xl_tot_len.Thomas Munro2023-09-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82a4b5, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
* Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.Tom Lane2023-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit 12d768e70 fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
* Fix GiST README's explanation of the NSN cross-check.Heikki Linnakangas2023-09-19
| | | | | | | | The text got the condition backwards, it's "NSN > LSN", not "NSN < LSN". While we're at it, expand it a little for clarity. Reviewed-by: Daniel Gustafsson Discussion: https://www.postgresql.org/message-id/4cb46e18-e688-524a-0f73-b1f03ed5d6ee@iki.fi
* Rename variable for code clarityDaniel Gustafsson2023-09-15
| | | | | | | | | | | When tracking IO timing for WAL, the duration is what we calculate based on the start and end timestamps, it's not what the variable contains. Rename the timestamp variable to end to better communicate what it contains. Original patch by Krishnakumar with additional hacking to fix another occurrence by me. Author: Krishnakumar R <kksrcv001@gmail.com> Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com
* Remove unnecessary smgrimmedsync() when creating unlogged table.Heikki Linnakangas2023-09-15
| | | | | | | | | | | | | | This became safe after commit 4b4798e138. The smgrcreate() call will now register the segment for syncing at the next checkpoint, so we don't need to sync it here. If a checkpoint happens before the creation is WAL-logged, the records will be replayed when starting recovery from the checkpoint. If a checkpoint happens after the WAL logging, the checkpoint will fsync() it. In the passing, clarify a comment in smgrDoPendingSyncs(). Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9%40iki.fi Reviewed-by: Robert Haas
* Quote filenames in error messagesDaniel Gustafsson2023-09-14
| | | | | | | | | | | | | | The majority of all filenames are quoted in user facing error and log messages, but a few were still printed without quotes. While these filenames do not risk causing any ambiguity as their format is strict, quote them anyways to be consistent across all logs. Also concatenate a message to keep it one line to make it easier to grep for in the code. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/080EEABE-6645-4A46-AB20-6285ADAC44FE@yesql.se
* Flush logical slots to disk during a shutdown checkpoint if required.Amit Kapila2023-09-14
| | | | | | | | | | | | | | | | | | | | | It's entirely possible for a logical slot to have a confirmed_flush LSN higher than the last value saved on disk while not being marked as dirty. Currently, it is not a major problem but a later patch adding support for the upgrade of slots relies on that value being properly flushed to disk. It can also help avoid processing the same transactions again in some boundary cases after the clean shutdown and restart. Say, we process some transactions for which we didn't send anything downstream (the changes got filtered) but the confirm_flush LSN is updated due to keepalives. As we don't flush the latest value of confirm_flush LSN, it may lead to processing the same changes again without this patch. The approach taken by this patch has been suggested by Ashutosh Bapat. Author: Vignesh C, Julien Rouhaud, Kuroda Hayato Reviewed-by: Amit Kapila, Dilip Kumar, Michael Paquier, Ashutosh Bapat, Peter Smith, Hou Zhijie Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
* Remove some more "snapshot too old" vestiges.Thomas Munro2023-09-08
| | | | | | | | | Commit f691f5b8 removed the logic, but left behind some now-useless Snapshot arguments to various AM-internal functions, and missed a couple of comments. Reported-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznj9qSNXZ1P1uWTUD_FeaTezbUazb416EPwi4Qr_jR_6A%40mail.gmail.com
* Remove the "snapshot too old" feature.Thomas Munro2023-09-05
| | | | | | | | | | | | | | | | | Remove the old_snapshot_threshold setting and mechanism for producing the error "snapshot too old", originally added by commit 848ef42b. Unfortunately it had a number of known problems in terms of correctness and performance, mostly reported by Andres in the course of his work on snapshot scalability. We agreed to remove it, after a long period without an active plan to fix it. This is certainly a desirable feature, and someone might propose a new or improved implementation in the future. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
* Report syncscan position at end of scan.Heikki Linnakangas2023-08-31
| | | | | | | | | | | | | | | | | | | The comment in heapgettup_advance_block() says that it reports the scan position before checking for end of scan, but that didn't match the code. The code was refactored in commit 7ae0ab0ad9, which inadvertently changed the order of the check and reporting. Change it back. This caused a few regression test failures with a small shared_buffers setting like 10 MB. The 'portals' and 'cluster' tests perform seqscans that are large enough that sync seqscans kick in. When the sync scan position is not updated at end of scan, the next seq scan doesn't start at the beginning of the table, and the test queries are sensitive to that. Reviewed-by: Melanie Plageman, David Rowley Discussion: https://www.postgresql.org/message-id/6f991389-ae22-d844-a9d8-9aceb7c01a9a@iki.fi Backpatch-through: 16
* Tweak pg_promote() to report failures on kill() or postmaster failuresMichael Paquier2023-08-29
| | | | | | | | | | | | | | | | | | | | | Since its introduction in 10074651e335, pg_promote() has been returning a false status in three cases: - SIGUSR1 not sent to the postmaster process. - Postmaster death during standby promotion. - Standby not promoted within the specified wait time. An application calling this function will have a hard time understanding what a false state returned actually means. Per discussion, this switches the two first states to fail rather than return a "false" status, making the second case more consistent with the existing CHECK_FOR_INTERRUPTS in the wait loop. False is only returned when the promotion is not completed within the specified time (60s by default). Author: Ashutosh Sharma Reviewed-by: Fujii Masao, Laurenz Albe, Michael Paquier Discussion: https://postgr.es/m/CAE9k0P=QTrwptL0t4J0fuBRDDjgsT-0PVKd-ikd96i1hyL7Bcg@mail.gmail.com
* Make error messages about WAL segment size more consistentPeter Eisentraut2023-08-28
| | | | | | | | | | | | | | | | Make the primary messages more compact and make the detail messages uniform. In initdb.c and pg_resetwal.c, use the newish option_parse_int() to simplify some of the option parsing. For the backend GUC wal_segment_size, add a GUC check hook to do the verification instead of coding it in bootstrap.c. This might be overkill, but that way the check is in the right place and it becomes more self-documenting. In passing, make pg_controldata use the logging API for warning messages. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/9939aa8a-d7be-da2c-7715-0a0b5535a1f7@eisentraut.org
* Fix _bt_allequalimage() call within critical section.Heikki Linnakangas2023-08-23
| | | | | | | | | _bt_allequalimage() does complicated things, so it's not OK to call it in a critical section. Per buildfarm failure on 'prion', which uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi Backpatch-through: 16, like commit ccadf73163 that introduced this
* Use the buffer cache when initializing an unlogged index.Heikki Linnakangas2023-08-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of the ambuildempty functions used smgrwrite() directly, followed by smgrimmedsync(). A few small problems with that: Firstly, one is supposed to use smgrextend() when extending a relation, not smgrwrite(). It doesn't make much difference in production builds. smgrextend() updates the relation size cache, so you miss that, but that's harmless because we never use the cached relation size of an init fork. But if you compile with CHECK_WRITE_VS_EXTEND, you get an assertion failure. Secondly, the smgrwrite() calls were performed before WAL-logging, so the page image written to disk had 0/0 as the LSN, not the LSN of the WAL record. That's also harmless in practice, but seems sloppy. Thirdly, it's better to use the buffer cache, because then you don't need to smgrimmedsync() the relation to disk, which adds latency. Bypassing the cache makes sense for bulk operations like index creation, but not when you're just initializing an empty index. Creation of unlogged tables is hardly performance bottleneck in any real world applications, but nevertheless. Backpatch to v16, but no further. These issues should be harmless in practice, so better to not rock the boat in older branches. Reviewed-by: Robert Haas Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi
* Introduce macros for protocol characters.Nathan Bossart2023-08-22
| | | | | | | | | | | This commit introduces descriptively-named macros for the identifiers used in wire protocol messages. These new macros are placed in a new header file so that they can be easily used by third-party code. Author: Dave Cramer Reviewed-by: Alvaro Herrera, Tatsuo Ishii, Peter Smith, Robert Haas, Tom Lane, Peter Eisentraut, Michael Paquier Discussion: https://postgr.es/m/CADK3HHKbBmK-PKf1bPNFoMC%2BoBt%2BpD9PH8h5nvmBQskEHm-Ehw%40mail.gmail.com
* ExtendBufferedWhat -> BufferManagerRelation.Thomas Munro2023-08-23
| | | | | | | | | | | | | | | | | Commit 31966b15 invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
* Cache by-reference missing values in a long lived contextAndrew Dunstan2023-08-22
| | | | | | | | | | | | | | | | Attribute missing values might be needed past the lifetime of the tuple descriptors from which they are extracted. To avoid possibly using pointers for by-reference values which might thus be left dangling, we cache a datumCopy'd version of the datum in the TopMemoryContext. Since we first search for the value this only needs to be done once per session for any such value. Original complaint from Tom Lane, idea for mitigation by Andrew Dunstan, tweaked by Tom Lane. Backpatch to version 11 where missing values were introduced. Discussion: https://postgr.es/m/1306569.1687978174@sss.pgh.pa.us
* hio: Take number of prior relation extensions into accountAndres Freund2023-08-14
| | | | | | | | | | | | | | | | | | | | | | | | The new relation extension logic, introduced in 00d1e02be24, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is *some* concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada <sawada.mshk@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com Backpatch: 16-, where the new relation extension code was added
* Fix off-by-one in XLogRecordMaxSize check.Noah Misch2023-08-12
| | | | | | | | | pg_logical_emit_message(false, '_', repeat('x', 1069547465)) failed with self-contradictory message "WAL record would be 1069547520 bytes (of maximum 1069547520 bytes)". There's no particular benefit from allowing or denying one byte in either direction; XLogRecordMaxSize could rise a few megabytes without trouble. Hence, this is just for cleanliness. Back-patch to v16, where this check first appeared.
* Document more assumptions of LWLock variable changes with WAL insertsMichael Paquier2023-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a few comments about what LWLockWaitForVar() relies on when a backend waits for a variable update on its LWLocks for WAL insertions up to an expected LSN. First, LWLockWaitForVar() does not include a memory barrier, relying on a spinlock taken at the beginning of WaitXLogInsertionsToFinish(). This was hidden behind two layers of routines in lwlock.c. This assumption is now documented at the top of LWLockWaitForVar(), and detailed at bit more within LWLockConflictsWithVar(). Second, document why WaitXLogInsertionsToFinish() does not include memory barriers, relying on a spinlock at its top, which is, per Andres' input, fine for two different reasons, both depending on the fact that the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a certain value. This area's documentation and assumptions could be improved more in the future, but at least that's a beginning. Author: Bharath Rupireddy, Andres Freund Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Optimize WAL insertion lock acquisition and release with some atomicsMichael Paquier2023-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The WAL insertion lock variable insertingAt is currently being read and written with the help of the LWLock wait list lock to avoid any read of torn values. This wait list lock can become a point of contention on a highly concurrent write workloads. This commit switches insertingAt to a 64b atomic variable that provides torn-free reads/writes. On platforms without 64b atomic support, the fallback implementation uses spinlocks to provide the same guarantees for the values read. LWLockWaitForVar(), through LWLockConflictsWithVar(), reads the new value to check if it still needs to wait with a u64 atomic operation. LWLockUpdateVar() updates the variable before waking up the waiters with an exchange_u64 (full memory barrier). LWLockReleaseClearVar() now uses also an exchange_u64 to reset the variable. Before this commit, all these steps relied on LWLockWaitListLock() and LWLockWaitListUnlock(). This reduces contention on LWLock wait list lock and improves performance of highly-concurrent write workloads. Here are some numbers using pg_logical_emit_message() (HEAD at d6677b93) with various arbitrary record lengths and clients up to 1k on a rather-large machine (64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of TPS numbers coming from pgbench: message_size_b | 16 | 64 | 256 | 1024 --------------------+--------+--------+--------+------- patch_4_clients | 83830 | 82929 | 80478 | 73131 patch_16_clients | 267655 | 264973 | 250566 | 213985 patch_64_clients | 380423 | 378318 | 356907 | 294248 patch_256_clients | 360915 | 354436 | 326209 | 263664 patch_512_clients | 332654 | 321199 | 287521 | 240128 patch_1024_clients | 288263 | 276614 | 258220 | 217063 patch_2048_clients | 252280 | 243558 | 230062 | 192429 patch_4096_clients | 212566 | 213654 | 205951 | 166955 head_4_clients | 83686 | 83766 | 81233 | 73749 head_16_clients | 266503 | 265546 | 249261 | 213645 head_64_clients | 366122 | 363462 | 341078 | 261707 head_256_clients | 132600 | 132573 | 134392 | 165799 head_512_clients | 118937 | 114332 | 116860 | 150672 head_1024_clients | 133546 | 115256 | 125236 | 151390 head_2048_clients | 137877 | 117802 | 120909 | 138165 head_4096_clients | 113440 | 115611 | 120635 | 114361 Bharath has been measuring similar improvements, where the limit of the WAL insertion lock begins to be felt when more than 256 concurrent clients are involved in this specific workload. An extra patch has been discussed to introduce a fast-exit path in LWLockUpdateVar() when there are no waiters, still this does not influence the write-heavy workload cases discussed as there are always waiters. This will be considered separately. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
* Fix indentation in twophase.cMichael Paquier2023-07-18
| | | | | | | | | This has been missed in cb0cca1, noticed before buildfarm member koel has been able to complain while poking at a different patch. Like the other commit, backpatch all the way down to limit the odds of merge conflicts. Backpatch-through: 11
* Fix recovery of 2PC transaction during crash recoveryMichael Paquier2023-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A crash in the middle of a checkpoint with some two-phase state data already flushed to disk by this checkpoint could cause a follow-up crash recovery to recover twice the same transaction, once from what has been found in pg_twophase/ at the beginning of recovery and a second time when replaying its corresponding record. This would lead to FATAL failures in the startup process during recovery, where the same transaction would have a state recovered twice instead of once: LOG: recovering prepared transaction 731 from shared memory LOG: recovering prepared transaction 731 from shared memory FATAL: lock ExclusiveLock on object 731/0/0 is already held This issue is fixed by skipping the addition of any 2PC state coming from a record whose equivalent 2PC state file has already been loaded in TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(), which is OK as long as the system has not reached a consistent state. The timing to get a messed up recovery processing is very racy, and would very unlikely happen. The thread that has reported the issue has demonstrated the bug using injection points to force a PANIC in the middle of a checkpoint. Issue introduced in 728bd99, so backpatch all the way down. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Michael Paquier Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com Backpatch-through: 11
* Rename session_auth_is_superuser to current_role_is_superuser.Nathan Bossart2023-07-12
| | | | | | | | | | | This variable might've been accurately named when it was added in ea886339b8, but the name hasn't been accurate since at least the introduction of SET ROLE in e5d6b91220. The corresponding documentation was fixed in eedb068c0a. This commit renames the variable accordingly. Suggested-by: Joseph Koshakow Discussion: https://postgr.es/m/CAAvxfHc-HHzONQ2oXdvhFF9ayRnidPwK%2BfVBhRzaBWYYLVQL-g%40mail.gmail.com
* Report index vacuum progress.Masahiko Sawada2023-07-11
| | | | | | | | | | | | | | | | | This commit adds two columns: indexes_total and indexes_processed, to pg_stat_progress_vacuum system view to show the index vacuum progress. These numbers are reported in the "vacuuming indexes" and "cleaning up indexes" phases. This uses the new parallel message type for progress reporting added by be06506e7. Bump catversion because this changes the definition of pg_stat_progress_vacuum. Author: Sami Imseih Reviewed by: Masahiko Sawada, Michael Paquier, Nathan Bossart, Andres Freund Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com