aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* WAL log unchanged toasted replica identity key attributes.Amit Kapila2022-02-14
| | | | | | | | | | | | | | | Currently, during UPDATE, the unchanged replica identity key attributes are not logged separately because they are getting logged as part of the new tuple. But if they are stored externally then the untoasted values are not getting logged as part of the new tuple and logical replication won't be able to replicate such UPDATEs. So we need to log such attributes as part of the old_key_tuple during UPDATE. Reported-by: Haiying Tang Author: Dilip Kumar and Amit Kapila Reviewed-by: Alvaro Herrera, Haiying Tang, Andres Freund Backpatch-through: 10 Discussion: https://postgr.es/m/OS0PR01MB611342D0A92D4F4BF26C0F47FB229@OS0PR01MB6113.jpnprd01.prod.outlook.com
* fsync pg_logical/mappings in CheckPointLogicalRewriteHeap().Andres Freund2022-01-21
| | | | | | | | | | | While individual logical rewrite files were synced to disk, the directory was not. On some filesystems that could lead to loosing directory entries after a crash. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Author: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/867F2E29-2782-4869-970E-B984C6D35A8F@amazon.com Backpatch: 10-
* Fix one-off bug causing missing commit timestamps for subtransactionsMichael Paquier2022-01-21
| | | | | | | | | | | | | | | | | | | The logic in charge of writing commit timestamps (enabled with track_commit_timestamp) for subtransactions had a one-bug bug, where it would be possible that commit timestamps go missing for the last subtransaction committed. While on it, simplify a bit the iteration logic in the loop writing the commit timestamps, as per suggestions from Kyotaro Horiguchi and Tom Lane, so as some variable initializations are not part of the loop itself. Issue introduced in 73c986a. Analyzed-by: Alex Kingsborough Author: Alex Kingsborough, Kyotaro Horiguchi Discussion: https://postgr.es/m/73A66172-4050-4F2A-B7F1-13508EDA2144@amazon.com Backpatch-through: 10
* Fix silly mistake in AssertAlvaro Herrera2022-01-04
|
* Allow special SKIP LOCKED condition in Assert()Alvaro Herrera2022-01-04
| | | | | | | | | | | | | | | | | Under concurrency, it is possible for two sessions to be merrily locking and releasing a tuple and marking it again as HEAP_XMAX_INVALID all the while a third session attempts to lock it, miserably fails at it, and then contemplates life, the universe and everything only to eventually fail an assertion that said bit is not set. Before SKIP LOCKED that was indeed a reasonable expectation, but alas! commit df630b0dd5ea falsified it. This bug is as old as time itself, and even older, if you think time begins with the oldest supported branch. Therefore, backpatch to all supported branches. Author: Simon Riggs <simon.riggs@enterprisedb.com> Discussion: https://postgr.es/m/CANbhV-FeEwMnN8yuMyss7if1ZKjOKfjcgqB26n8pqu1e=q0ebg@mail.gmail.com
* Remove assertion for replication origins in PREPARE TRANSACTIONMichael Paquier2021-12-14
| | | | | | | | | | | | | | | | | When using replication origins, pg_replication_origin_xact_setup() is an optional choice to be able to set a LSN and a timestamp to mark the origin, which would be additionally added to WAL for transaction commits or aborts (including 2PC transactions). An assertion in the code path of PREPARE TRANSACTION assumed that this data should always be set, so it would trigger when using replication origins without setting up an origin LSN. Some tests are added to cover more this kind of scenario. Oversight in commit 1eb6d65. Per discussion with Amit Kapila and Masahiko Sawada. Discussion: https://postgr.es/m/YbbBfNSvMm5nIINV@paquier.xyz Backpatch-through: 11
* Fix corruption of toast indexes with REINDEX CONCURRENTLYMichael Paquier2021-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | REINDEX CONCURRENTLY run on a toast index or a toast relation could corrupt the target indexes rebuilt, as a backend running in parallel that manipulates toast values would directly release the lock on the toast relation when its local operation is done, rather than releasing the lock once the transaction that manipulated the toast values committed. The fix done here is simple: we now hold a ROW EXCLUSIVE lock on the toast relation when saving or deleting a toast value until the transaction working on them is committed, so as a concurrent reindex happening in parallel would be able to wait for any activity and see any new rows inserted (or deleted). An isolation test is added to check after the case fixed here, which is a bit fancy by design as it relies on allow_system_table_mods to rename the toast table and its index to fixed names. This way, it is possible to reindex them directly without any dependency on the OID of the underlying relation. Note that this could not use a DO block either, as REINDEX CONCURRENTLY cannot be run in a transaction block. The test is backpatched down to 13, where it is possible, thanks to c4a7a39, to use allow_system_table_mods in a test suite. Reported-by: Alexey Ermakov Analyzed-by: Andres Freund, Noah Misch Author: Michael Paquier Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/17268-d2fb426e0895abd4@postgresql.org Backpatch-through: 12
* Fix determination of broken LSN in OVERWRITTEN_CONTRECORDAlvaro Herrera2021-11-26
| | | | | | | | | | | | | | In commit ff9f111bce24 I mixed up inconsistent definitions of the LSN of the first record in a page, when the previous record ends exactly at the page boundary. The correct LSN is adjusted to skip the WAL page header; I failed to use that when setting XLogReaderState->overwrittenRecPtr, so at WAL replay time VerifyOverwriteContrecord would refuse to let replay continue past that record. Backpatch to 10. 9.6 also contains this bug, but it's no longer being maintained. Discussion: https://postgr.es/m/45597.1637694259@sss.pgh.pa.us
* Fix parallel operations that prevent oldest xmin from advancing.Amit Kapila2021-11-19
| | | | | | | | | | | | | | | | | While determining xid horizons, we skip over backends that are running Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently for the purposes of computing Xmin for Vacuum. But we were not setting the flags corresponding to these operations when they are performed in parallel which was preventing Xid horizon from advancing. The optimization related to skipping Create Index Concurrently, or Reindex Concurrently operations was implemented in PG-14 but the fix is the same for the Parallel Vacuum as well so back-patched till PG-13. Author: Masahiko Sawada Reviewed-by: Amit Kapila Backpatch-through: 13 Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com
* Report any XLogReadRecord() error in XlogReadTwoPhaseData().Noah Misch2021-11-11
| | | | | | | | | | Buildfarm members kittiwake and tadarida have witnessed errors at this site. The site discarded key facts. Back-patch to v10 (all supported versions). Reviewed by Michael Paquier and Tom Lane. Discussion: https://postgr.es/m/20211107013157.GB790288@rfd.leadboat.com
* Fix typoAlvaro Herrera2021-11-08
| | | | | | | Introduced in 1d97d3d0867f. Co-authored-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/83641f59-d566-b33e-ef21-a272a98675aa@gmail.com
* Fix ordering of items in nbtree error message.Peter Geoghegan2021-10-27
| | | | | | Oversight in commit a5213adf. Backpatch: 13-, just like commit a5213adf.
* Further harden nbtree posting split code.Peter Geoghegan2021-10-27
| | | | | | | | | | | Add more defensive checks around posting list split code. These should detect corruption involving duplicate table TIDs earlier and more reliably than any existing check. Follow up to commit 8f72bbac. Discussion: https://postgr.es/m/CAH2-WzkrSY_kjyd1_M5xJK1uM0govJXMxPn8JUSvwcUOiHuWVw@mail.gmail.com Backpatch: 13-, where nbtree deduplication was introduced.
* Fix CREATE INDEX CONCURRENTLY for the newest prepared transactions.Noah Misch2021-10-23
| | | | | | | | | | | | | | | | | | | The purpose of commit 8a54e12a38d1545d249f1402f66c8cde2837d97c was to fix this, and it sufficed when the PREPARE TRANSACTION completed before the CIC looked for lock conflicts. Otherwise, things still broke. As before, in a cluster having used CIC while having enabled prepared transactions, queries that use the resulting index can silently fail to find rows. It may be necessary to reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices. Fix this for future index builds by making CIC wait for arbitrarily-recent prepared transactions and for ordinary transactions that may yet PREPARE TRANSACTION. As part of that, have PREPARE TRANSACTION transfer locks to its dummy PGPROC before it calls ProcArrayClearTransaction(). Back-patch to 9.6 (all supported versions). Andrey Borodin, reviewed (in earlier versions) by Andres Freund. Discussion: https://postgr.es/m/01824242-AA92-4FE9-9BA7-AEBAFFEA3D0C@yandex-team.ru
* Reset properly snapshot export state during transaction abortMichael Paquier2021-10-18
| | | | | | | | | | | | | | | | | | | | | | | | During a replication slot creation, an ERROR generated in the same transaction as the one creating a to-be-exported snapshot would have left the backend in an inconsistent state, as the associated static export snapshot state was not being reset on transaction abort, but only on the follow-up command received by the WAL sender that created this snapshot on replication slot creation. This would trigger inconsistency failures if this session tried to export again a snapshot, like during the creation of a replication slot. Note that a snapshot export cannot happen in a transaction block, so there is no need to worry resetting this state for subtransaction aborts. Also, this inconsistent state would very unlikely show up to users. For example, one case where this could happen is an out-of-memory error when building the initial snapshot to-be-exported. Dilip found this problem while poking at a different patch, that caused an error in this code path for reasons unrelated to HEAD. Author: Dilip Kumar Reviewed-by: Michael Paquier, Zhihong Yu Discussion: https://postgr.es/m/CAFiTN-s0zA1Kj0ozGHwkYkHwa5U0zUE94RSc_g81WrpcETB5=w@mail.gmail.com Backpatch-through: 9.6
* Fix snapshot builds during promotion of hot standby node with 2PCMichael Paquier2021-10-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some specific logic is done at the end of recovery when involving 2PC transactions: 1) Call RecoverPreparedTransactions(), to recover the state of 2PC transactions into memory (re-acquire locks, etc.). 2) ShutdownRecoveryTransactionEnvironment(), to move back to normal operations, mainly cleaning up recovery locks and KnownAssignedXids (including any 2PC transaction tracked previously). 3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is the tipping point for any process calling RecoveryInProgress() to check if the cluster is still in recovery or not. Any snapshot taken between steps 2) and 3) would be empty, causing any transaction relying on a snapshot at this point to potentially corrupt data as there could still be some 2PC transactions to track, with RecentXmin moving backwards on successive calls to GetSnapshotData() in the same transaction. As SharedRecoveryState is the point to take into account to know if it is safe to discard KnownAssignedXids, this commit moves step 2) after step 3), so as we can never finish with empty snapshots. This exists since the introduction of hot standby, so backpatch all the way down. The window with incorrect snapshots is extremely small, but I have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm member fairywren. Thomas Munro also found it independently. Special thanks to Andres Freund for taking the time to analyze this issue. Reported-by: Thomas Munro, Michael Paquier Analyzed-by: Andres Freund Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de Backpatch-through: 9.6
* Fix Portal snapshot tracking to handle subtransactions properly.Tom Lane2021-10-01
| | | | | | | | | | | | | | | | | | | | | | | Commit 84f5c2908 forgot to consider the possibility that EnsurePortalSnapshotExists could run inside a subtransaction with lifespan shorter than the Portal's. In that case, the new active snapshot would be popped at the end of the subtransaction, leaving a dangling pointer in the Portal, with mayhem ensuing. To fix, make sure the ActiveSnapshot stack entry is marked with the same subtransaction nesting level as the associated Portal. It's certainly safe to do so since we won't be here at all unless the stack is empty; hence we can't create an out-of-order stack. Let's also apply this logic in the case where PortalRunUtility sets portalSnapshot, just to be sure that path can't cause similar problems. It's slightly less clear that that path can't create an out-of-order stack, so add an assertion guarding it. Report and patch by Bertrand Drouvot (with kibitzing by me). Back-patch to v11, like the previous commit. Discussion: https://postgr.es/m/ff82b8c5-77f4-3fe7-6028-fcf3303e82dd@amazon.com
* Fix WAL replay in presence of an incomplete recordAlvaro Herrera2021-09-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Physical replication always ships WAL segment files to replicas once they are complete. This is a problem if one WAL record is split across a segment boundary and the primary server crashes before writing down the segment with the next portion of the WAL record: WAL writing after crash recovery would happily resume at the point where the broken record started, overwriting that record ... but any standby or backup may have already received a copy of that segment, and they are not rewinding. This causes standbys to stop following the primary after the latter crashes: LOG: invalid contrecord length 7262 at A8/D9FFFBC8 because the standby is still trying to read the continuation record (contrecord) for the original long WAL record, but it is not there and it will never be. A workaround is to stop the replica, delete the WAL file, and restart it -- at which point a fresh copy is brought over from the primary. But that's pretty labor intensive, and I bet many users would just give up and re-clone the standby instead. A fix for this problem was already attempted in commit 515e3d84a0b5, but it only addressed the case for the scenario of WAL archiving, so streaming replication would still be a problem (as well as other things such as taking a filesystem-level backup while the server is down after having crashed), and it had performance scalability problems too; so it had to be reverted. This commit fixes the problem using an approach suggested by Andres Freund, whereby the initial portion(s) of the split-up WAL record are kept, and a special type of WAL record is written where the contrecord was lost, so that WAL replay in the replica knows to skip the broken parts. With this approach, we can continue to stream/archive segment files as soon as they are complete, and replay of the broken records will proceed across the crash point without a hitch. Because a new type of WAL record is added, users should be careful to upgrade standbys first, primaries later. Otherwise they risk the standby being unable to start if the primary happens to write such a record. A new TAP test that exercises this is added, but the portability of it is yet to be seen. This has been wrong since the introduction of physical replication, so backpatch all the way back. In stable branches, keep the new XLogReaderState members at the end of the struct, to avoid an ABI break. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Nathan Bossart <bossartn@amazon.com> Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
* Remove overzealous index deletion assertion.Peter Geoghegan2021-09-20
| | | | | | | | | | | | | | | | | | | | A broken HOT chain is not an unexpected condition, even when the offset number points past the end of the page's line pointer array. heap_prune_chain() does not (and never has) treated this condition as unexpected, so derivative code in heap_index_delete_tuples() shouldn't do so either. Oversight in commit 4228817449. The assertion can probably only fail on Postgres 14 and master. Earlier releases don't have commit 3c3b8a4b, which taught VACUUM to truncate the line pointer array of heap pages. Backpatch all the same, just to be consistent. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17197-9438f31f46705182@postgresql.org Backpatch: 12-, just like commit 4228817449.
* Send NOTIFY signals during CommitTransaction.Tom Lane2021-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Formerly, we sent signals for outgoing NOTIFY messages within ProcessCompletedNotifies, which was also responsible for sending relevant ones of those messages to our connected client. It therefore had to run during the main-loop processing that occurs just before going idle. This arrangement had two big disadvantages: * Now that procedures allow intra-command COMMITs, it would be useful to send NOTIFYs to other sessions immediately at COMMIT (though, for reasons of wire-protocol stability, we still shouldn't forward them to our client until end of command). * Background processes such as replication workers would not send NOTIFYs at all, since they never execute the client communication loop. We've had requests to allow triggers running in replication workers to send NOTIFYs, so that's a problem. To fix these things, move transmission of outgoing NOTIFY signals into AtCommit_Notify, where it will happen during CommitTransaction. Also move the possible call of asyncQueueAdvanceTail there, to ensure we don't bloat the async SLRU if a background worker sends many NOTIFYs with no one listening. We can also drop the call of asyncQueueReadAllNotifications, allowing ProcessCompletedNotifies to go away entirely. That's because commit 790026972 added a call of ProcessNotifyInterrupt adjacent to PostgresMain's call of ProcessCompletedNotifies, and that does its own call of asyncQueueReadAllNotifications, meaning that we were uselessly doing two such calls (inside two separate transactions) whenever inbound notify signals coincided with an outbound notify. We need only set notifyInterruptPending to ensure that ProcessNotifyInterrupt runs, and we're done. The existing documentation suggests that custom background workers should call ProcessCompletedNotifies if they want to send NOTIFY messages. To avoid an ABI break in the back branches, reduce it to an empty routine rather than removing it entirely. Removal will occur in v15. Although the problems mentioned above have existed for awhile, I don't feel comfortable back-patching this any further than v13. There was quite a bit of churn in adjacent code between 12 and 13. At minimum we'd have to also backpatch 51004c717, and a good deal of other adjustment would also be needed, so the benefit-to-risk ratio doesn't look attractive. Per bug #15293 from Michael Powers (and similar gripes from others). Artur Zakirov and Tom Lane Discussion: https://postgr.es/m/153243441449.1404.2274116228506175596@wrigleys.postgresql.org
* Further portability tweaks for float4/float8 hash functions.Tom Lane2021-09-04
| | | | | | | | | | Attempting to make hashfloat4() look as much as possible like hashfloat8(), I'd figured I could replace NaNs with get_float4_nan() before widening to float8. However, results from protosciurus and topminnow show that on some platforms that produces a different bit-pattern from get_float8_nan(), breaking the intent of ce773f230. Rearrange so that we use the result of get_float8_nan() for all NaN cases. As before, back-patch.
* Revert "Avoid creating archive status ".ready" files too early"Alvaro Herrera2021-09-04
| | | | | | | | | | This reverts commit 515e3d84a0b5 and equivalent commits in back branches. This solution to the problem has a number of problems, so we'll try again with a different approach. Per note from Andres Freund Discussion: https://postgr.es/m/20210831042949.52eqp5xwbxgrfank@alap3.anarazel.de
* Fix float4/float8 hash functions to produce uniform results for NaNs.Tom Lane2021-09-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IEEE 754 standard allows a wide variety of bit patterns for NaNs, of which at least two ("NaN" and "-NaN") are pretty easy to produce from SQL on most machines. This is problematic because our btree comparison functions deem all NaNs to be equal, but our float hash functions know nothing about NaNs and will happily produce varying hash codes for them. That causes unexpected results from queries that hash a column containing different NaN values. It could also produce unexpected lookup failures when using a hash index on a float column, i.e. "WHERE x = 'NaN'" will not find all the rows it should. To fix, special-case NaN in the float hash functions, not too much unlike the existing special case that forces zero and minus zero to hash the same. I arranged for the most vanilla sort of NaN (that coming from the C99 NAN constant) to still have the same hash code as before, to reduce the risk to existing hash indexes. I dithered about whether to back-patch this into stable branches, but ultimately decided to do so. It's a clear improvement for queries that hash internally. If there is anybody who has -NaN in a hash index, they'd be well advised to re-index after applying this patch ... but the misbehavior if they don't will not be much worse than the misbehavior they had before. Per bug #17172 from Ma Liangzhu. Discussion: https://postgr.es/m/17172-7505bea9e04e230f@postgresql.org
* Report tuple address in data-corruption error messageAlvaro Herrera2021-08-30
| | | | | | | | | | | Most data-corruption reports mention the location of the problem, but this one failed to. Add it. Backpatch all the way back. In 12 and older, also assign the ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf890 for 13 and later. Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql
* Count SP-GiST index scans in pg_stat statistics.Tom Lane2021-08-27
| | | | | | | | | | | | | | | | | | | Somehow, spgist overlooked the need to call pgstat_count_index_scan(). Hence, pg_stat_all_indexes.idx_scan and equivalent columns never became nonzero for an SP-GiST index, although the related per-tuple counters worked fine. This fix works a bit differently from other index AMs, in that the counter increment occurs in spgrescan not spggettuple/spggetbitmap. It looks like this won't make the user-visible semantics noticeably different, so I won't go to the trouble of introducing an is-this- the-first-call flag just to make the counter bumps happen in the same places. Per bug #17163 from Christian Quest. Back-patch to all supported versions. Discussion: https://postgr.es/m/17163-b8c5cc88322a5e92@postgresql.org
* Fix broken snapshot handling in parallel workers.Robert Haas2021-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | Pengchengliu reported an assertion failure in a parallel woker while performing a parallel scan using an overflowed snapshot. The proximate cause is that TransactionXmin was set to an incorrect value. The underlying cause is incorrect snapshot handling in parallel.c. In particular, InitializeParallelDSM() was unconditionally calling GetTransactionSnapshot(), because I (rhaas) mistakenly thought that was always retrieving an existing snapshot whereas, at isolation levels less than REPEATABLE READ, it's actually taking a new one. So instead do this only at higher isolation levels where there actually is a single snapshot for the whole transaction. By itself, this is not a sufficient fix, because we still need to guarantee that TransactionXmin gets set properly in the workers. The easiest way to do that seems to be to install the leader's active snapshot as the transaction snapshot if the leader did not serialize a transaction snapshot. This doesn't affect the results of future GetTrasnactionSnapshot() calls since those have to take a new snapshot anyway; what we care about is the side effect of setting TransactionXmin. Report by Pengchengliu. Patch by Greg Nancarrow, except for some comment text which I supplied. Discussion: https://postgr.es/m/002f01d748ac$eaa781a0$bff684e0$@tju.edu.cn
* Avoid creating archive status ".ready" files too earlyAlvaro Herrera2021-08-23
| | | | | | | | | | | | | | | | | | | | | | | WAL records may span multiple segments, but XLogWrite() does not wait for the entire record to be written out to disk before creating archive status files. Instead, as soon as the last WAL page of the segment is written, the archive status file is created, and the archiver may process it. If PostgreSQL crashes before it is able to write and flush the rest of the record (in the next WAL segment), the wrong version of the first segment file lingers in the archive, which causes operations such as point-in-time restores to fail. To fix this, keep track of records that span across segments and ensure that segments are only marked ready-for-archival once such records have been completely written to disk. This has always been wrong, so backpatch all the way back. Author: Nathan Bossart <bossartn@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
* Refresh apply delay on reload of recovery_min_apply_delay at recoveryMichael Paquier2021-08-16
| | | | | | | | | | | | | | | | | This commit ensures that the wait interval in the replay delay loop waiting for an amount of time defined by recovery_min_apply_delay is correctly handled on reload, recalculating the delay if this GUC value is updated, based on the timestamp of the commit record being replayed. The previous behavior would be problematic for example with replay still waiting even if the delay got reduced or just cancelled. If the apply delay was increased to a larger value, the wait would have just respected the old value set, finishing earlier. Author: Soumyadeep Chakraborty, Ashwin Agrawal Reviewed-by: Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com Backpatch-through: 9.6
* Update minimum recovery point on truncation during WAL replay of abort record.Fujii Masao2021-07-29
| | | | | | | | | | | | | | | | If a file is truncated, we must update minRecoveryPoint. Once a file is truncated, there's no going back; it would not be safe to stop recovery at a point earlier than that anymore. Commit 7bffc9b7bf changed xact_redo_commit() so that it updates minRecoveryPoint on truncation, but forgot to change xact_redo_abort(). Back-patch to all supported versions. Reported-by: mengjuan.cmj@alibaba-inc.com Author: Fujii Masao Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/b029fce3-4fac-4265-968e-16f36ff4d075.mengjuan.cmj@alibaba-inc.com
* Make the standby server promptly handle interrupt signals.Fujii Masao2021-07-25
| | | | | | | | | | | | | | | | | | | This commit changes the startup process in the standby server so that it handles the interrupt signals after waiting for wal_retrieve_retry_interval on the latch and resetting it, before entering another wait on the latch. This change causes the standby server to promptly handle interrupt signals. Otherwise, previously, there was the case where the standby needs to wait extra five seconds to shutdown when the shutdown request arrived while the startup process was waiting for wal_retrieve_retry_interval on the latch. Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty Reviewed-by: Soumyadeep Chakraborty Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com Per discussion of BUG #17073, back-patch to all supported versions. Discussion: https://postgr.es/m/17073-1a5fdaed0fa5d4d0@postgresql.org
* Advance old-segment horizon properly after slot invalidationAlvaro Herrera2021-07-16
| | | | | | | | | | | | | | | | When some slots are invalidated due to the max_slot_wal_keep_size limit, the old segment horizon should move forward to stay within the limit. However, in commit c6550776394e we forgot to call KeepLogSeg again to recompute the horizon after invalidating replication slots. In cases where other slots remained, the limits would be recomputed eventually for other reasons, but if all slots were invalidated, the limits would not move at all afterwards. Repair. Backpatch to 13 where the feature was introduced. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Marcin Krupowicz <mk@071.ovh> Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
* Remove dead assignment to local variable.Heikki Linnakangas2021-07-12
| | | | | | | | | | | | This should have been removed in commit 7e30c186da, which split the loop into two. Only the first loop uses the 'from' variable; updating it in the second loop is bogus. It was never read after the first loop, so this was harmless and surely optimized away by the compiler, but let's be tidy. Backpatch to all supported versions. Author: Ranier Vilela Discussion: https://www.postgresql.org/message-id/CAEudQAoWq%2BAL3BnELHu7gms2GN07k-np6yLbukGaxJ1vY-zeiQ%40mail.gmail.com
* Fix incorrect PITR message for transaction ROLLBACK PREPAREDMichael Paquier2021-06-30
| | | | | | | | | | | Reaching PITR on such a transaction would cause the generation of a LOG message mentioning a transaction committed, not aborted. Oversight in 4f1b890. Author: Simon Riggs Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com Backpatch-through: 9.6
* Fix race condition in TransactionGroupUpdateXidStatus().Amit Kapila2021-06-28
| | | | | | | | | | | | | | | | | | | When we cannot immediately acquire XactSLRULock in exclusive mode at commit time, we add ourselves to a list of processes that need their XIDs status update. We do this if the clog page where we need to update the current transaction status is the same as the group leader's clog page, otherwise, we allow the caller to clear it by itself. Now, when we can't add ourselves to any group, we were not clearing the current proc if it has already become a member of some group which was leading to an assertion failure when the same proc was assigned to another backend after the current backend exits. Reported-by: Alexander Lakhin Bug: 17072 Author: Amit Kapila Tested-By: Alexander Lakhin Backpatch-through: 11, where it was introduced Discussion: https://postgr.es/m/17072-2f8764857ef2c92a@postgresql.org
* Tidy up GetMultiXactIdMembers()'s behavior on errorHeikki Linnakangas2021-06-17
| | | | | | | | | | | | | | | | | | | | | | | | | One of the error paths left *members uninitialized. That's not a live bug, because most callers don't look at *members when the function returns -1, but let's be tidy. One caller, in heap_lock_tuple(), does "if (members != NULL) pfree(members)", but AFAICS it never passes an invalid 'multi' value so it should not reach that error case. The callers are also a bit inconsistent in their expectations. heap_lock_tuple() pfrees the 'members' array if it's not-NULL, others pfree() it if "nmembers >= 0", and others if "nmembers > 0". That's not a live bug either, because the function should never return 0, but add an Assert for that to make it more clear. I left the callers alone for now. I also moved the line where we set *nmembers. It wasn't wrong before, but I like to do that right next to the 'return' statement, to make it clear that it's always set on return. Also remove one unreachable return statement after ereport(ERROR), for brevity and for consistency with the similar if-block right after it. Author: Greg Nancarrow with the additional changes by me Backpatch-through: 9.6, all supported versions
* Fix outdated comment that talked about seek position of WAL file.Heikki Linnakangas2021-06-16
| | | | | | | | Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL file, which doesn't change the seek position (unless we fall back to the implementation in src/port/pread.c). Update comment accordingly. Backpatch-through: 12, where we started to use pg_pread()
* Report sort phase progress in parallel btree buildAlvaro Herrera2021-06-11
| | | | | | | | | | | | | | | | | | | | | | | We were already reporting it, but only after the parallel workers were finished, which is visibly much later than what happens in a serial build. With this change we report it when the leader starts its own sort phase when participating in the build (the normal case). Now this might happen a little later than when the workers start their sorting phases, but a) communicating the actual phase start from workers is likely to be a hassle, and b) the sort phase start is pretty fuzzy anyway, since sorting per se is actually initiated by tuplesort.c internally earlier than tuplesort_performsort() is called. Backpatch to pg12, where the progress reporting code for CREATE INDEX went in. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Greg Nancarrow <gregn4422@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/1128176d-1eee-55d4-37ca-e63644422adb
* Fix corner case failure of new standby to follow new primary.Robert Haas2021-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | This only happens if (1) the new standby has no WAL available locally, (2) the new standby is starting from the old timeline, (3) the promotion happened in the WAL segment from which the new standby is starting, (4) the timeline history file for the new timeline is available from the archive but the WAL files for are not (i.e. this is a race), (5) the WAL files for the new timeline are available via streaming, and (6) recovery_target_timeline='latest'. Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this logic and was an improvement over the previous code, but it mishandled this case. If recovery_target_timeline='latest' and restore_command is set, validateRecoveryParameters() can change recoveryTargetTLI to be different from receiveTLI. If streaming is then tried afterward, expectedTLEs gets initialized with the history of the wrong timeline. It's supposed to be a list of entries explaining how to get to the target timeline, but in this case it ends up with a list of entries explaining how to get to the new standby's original timeline, which isn't right. Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi. Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
* Avoid detoasting failure after COMMIT inside a plpgsql FOR loop.Tom Lane2021-05-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | exec_for_query() normally tries to prefetch a few rows at a time from the query being iterated over, so as to reduce executor entry/exit overhead. Unfortunately this is unsafe if we have COMMIT or ROLLBACK within the loop, because there might be TOAST references in the data that we prefetched but haven't yet examined. Immediately after the COMMIT/ROLLBACK, we have no snapshots in the session, meaning that VACUUM is at liberty to remove recently-deleted TOAST rows. This was originally reported as a case triggering the "no known snapshots" error in init_toast_snapshot(), but even if you miss hitting that, you can get "missing toast chunk", as illustrated by the added isolation test case. To fix, just disable prefetching in non-atomic contexts. Maybe there will be performance complaints prompting us to work harder later, but it's not clear at the moment that this really costs much, and I doubt we'd want to back-patch any complicated fix. In passing, adjust that error message in init_toast_snapshot() to be a little clearer about the likely cause of the problem. Patch by me, based on earlier investigation by Konstantin Knizhnik. Per bug #15990 from Andreas Wicht. Back-patch to v11 where intra-procedure COMMIT was added. Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org
* Harden nbtree deduplication posting split code.Peter Geoghegan2021-05-14
| | | | | | | | | | | | | | | | | | Add a defensive "can't happen" error to code that handles nbtree posting list splits (promote an existing assertion). This avoids a segfault in the event of an insertion of a newitem that is somehow identical to an existing non-pivot tuple in the index. An nbtree index should never have two index tuples with identical TIDs. This scenario is not particular unlikely in the event of any kind of corruption that leaves the index in an inconsistent state relative to the heap relation that is indexed. There are two known reports of preventable hard crashes. Doing nothing seems unacceptable given the general expectation that nbtree will cope reasonably well with corrupt data. Discussion: https://postgr.es/m/CAH2-Wz=Jr_d-dOYEEmwz0-ifojVNWho01eAqewfQXgKfoe114w@mail.gmail.com Backpatch: 13-, where nbtree deduplication was introduced.
* Prevent infinite insertion loops in spgdoinsert().Tom Lane2021-05-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Formerly we just relied on operator classes that assert longValuesOK to eventually shorten the leaf value enough to fit on an index page. That fails since the introduction of INCLUDE-column support (commit 09c1c6ab4), because the INCLUDE columns might alone take up more than a page, meaning no amount of leaf-datum compaction will get the job done. At least with spgtextproc.c, that leads to an infinite loop, since spgtextproc.c won't throw an error for not being able to shorten the leaf datum anymore. To fix without breaking cases that would otherwise work, add logic to spgdoinsert() to verify that the leaf tuple size is decreasing after each "choose" step. Some opclasses might not decrease the size on every single cycle, and in any case, alignment roundoff of the tuple size could obscure small gains. Therefore, allow up to 10 cycles without additional savings before throwing an error. (Perhaps this number will need adjustment, but it seems quite generous right now.) As long as we've developed this logic, let's back-patch it. The back branches don't have INCLUDE columns to worry about, but this seems like a good defense against possible bugs in operator classes. We already know that an infinite loop here is pretty unpleasant, so having a defense seems to outweigh the risk of breaking things. (Note that spgtextproc.c is actually the only known opclass with longValuesOK support, so that this is all moot for known non-core opclasses anyway.) Per report from Dilip Kumar. Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com
* Fix query-cancel handling in spgdoinsert().Tom Lane2021-05-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Knowing that a buggy opclass could cause an infinite insertion loop, spgdoinsert() intended to allow its loop to be interrupted by query cancel. However, that never actually worked, because in iterations after the first, we'd be holding buffer lock(s) which would cause InterruptHoldoffCount to be positive, preventing servicing of the interrupt. To fix, check if an interrupt is pending, and if so fall out of the insertion loop and service the interrupt after we've released the buffers. If it was indeed a query cancel, that's the end of the matter. If it was a non-canceling interrupt reason, make use of the existing provision to retry the whole insertion. (This isn't as wasteful as it might seem, since any upper-level index tuples we already created should be usable in the next attempt.) While there's no known instance of such a bug in existing release branches, it still seems like a good idea to back-patch this to all supported branches, since the behavior is fairly nasty if a loop does happen --- not only is it uncancelable, but it will quickly consume memory to the point of an OOM failure. In any case, this code is certainly not working as intended. Per report from Dilip Kumar. Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com
* Fix mishandling of resjunk columns in ON CONFLICT ... UPDATE tlists.Tom Lane2021-05-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present. If it happens, the ON CONFLICT UPDATE code path would end up storing tuples that include the values of the extra resjunk columns. That's fairly harmless in the short run, but if new columns are added to the table then the values would become accessible, possibly leading to malfunctions if they don't match the datatypes of the new columns. This had escaped notice through a confluence of missing sanity checks, including * There's no cross-check that a tuple presented to heap_insert or heap_update matches the table rowtype. While it's difficult to check that fully at reasonable cost, we can easily add assertions that there aren't too many columns. * The output-column-assignment cases in execExprInterp.c lacked any sanity checks on the output column numbers, which seems like an oversight considering there are plenty of assertion checks on input column numbers. Add assertions there too. * We failed to apply nodeModifyTable's ExecCheckPlanOutput() to the ON CONFLICT UPDATE tlist. That wouldn't have caught this specific error, since that function is chartered to ignore resjunk columns; but it sure seems like a bad omission now that we've seen this bug. In HEAD, the right way to fix this is to make the processing of ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists now do, that is don't add "SET x = x" entries, and use ExecBuildUpdateProjection to evaluate the tlist and combine it with old values of the not-set columns. This adds a little complication to ExecBuildUpdateProjection, but allows removal of a comparable amount of now-dead code from the planner. In the back branches, the most expedient solution seems to be to (a) use an output slot for the ON CONFLICT UPDATE projection that actually matches the target table, and then (b) invent a variant of ExecBuildProjectionInfo that can be told to not store values resulting from resjunk columns, so it doesn't try to store into nonexistent columns of the output slot. (We can't simply ignore the resjunk columns altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.) This works back to v10. In 9.6, projections work much differently and we can't cheaply give them such an option. The 9.6 version of this patch works by inserting a JunkFilter when it's necessary to get rid of resjunk columns. In addition, v11 and up have the reverse problem when trying to perform ON CONFLICT UPDATE on a partitioned table. Through a further oversight, adjust_partition_tlist() discarded resjunk columns when re-ordering the ON CONFLICT UPDATE tlist to match a partition. This accidentally prevented the storing-bogus-tuples problem, but at the cost that MULTIEXPR_SUBLINK cases didn't work, typically crashing if more than one row has to be updated. Fix by preserving resjunk columns in that routine. (I failed to resist the temptation to add more assertions there too, and to do some minor code beautification.) Per report from Andres Freund. Back-patch to all supported branches. Security: CVE-2021-32028
* Avoid improbable PANIC during heap_update.Tom Lane2021-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably 0d1fe9f74), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
* Fix potential SSI hazard in heap_update().Thomas Munro2021-04-13
| | | | | | | | | | | | | | | | | Commit 6f38d4dac38 failed to heed a warning about the stability of the value pointed to by "otid". The caller is allowed to pass in a pointer to newtup->t_self, which will be updated during the execution of the function. Instead, the SSI check should use the value we copy into oldtup.t_self near the top of the function. Not a live bug, because newtup->t_self doesn't really get updated until a bit later, but it was confusing and broke the rule established by the comment. Back-patch to 13. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2689164.1618160085%40sss.pgh.pa.us
* Allocate access strategy in parallel VACUUM workers.Amit Kapila2021-04-12
| | | | | | | | | | | | | | | Currently, parallel vacuum workers don't use any buffer access strategy. We can fix it either by propagating the access strategy from a leader or allow each worker to use BAS_VACUUM access strategy type. We don't see much use in propagating this information from leader as both leader and workers are supposed to use the same strategy. We might want to use a different access strategy for leader and workers but that would be a separate optimization not suitable for back-branches. This has been fixed in HEAD as commit f6b8f19a08. Author: Amit Kapila Reviewed-by: Sawada Masahiko, Bharath Rupireddy Discussion: https://postgr.es/m/CAA4eK1KbmJgRV2W3BbzRnKUSrukN7SbqBBriC4RDB5KBhopkGQ@mail.gmail.com
* Don't add non-existent pages to bitmap from BRINTomas Vondra2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in bringetbitmap() simply added the whole matching page range to the TID bitmap, as determined by pages_per_range, even if some of the pages were beyond the end of the heap. The query then might fail with an error like this: ERROR: could not open file "base/20176/20228.2" (target block 262144): previous segment is only 131021 blocks In this case, the relation has 262093 pages (131072 and 131021 pages), but we're trying to acess block 262144, i.e. first block of the 3rd segment. At that point _mdfd_getseg() notices the preceding segment is incomplete, and fails. Hitting this in practice is rather unlikely, because: * Most indexes use power-of-two ranges, so segments and page ranges align perfectly (segment end is also a page range end). * The table size has to be just right, with the last segment being almost full - less than one page range from full segment, so that the last page range actually crosses the segment boundary. * Prefetch has to be enabled. The regular page access checks that pages are not beyond heap end, but prefetch does not. On older releases (before 12) the execution stops after hitting the first non-existent page, so the prefetch distance has to be sufficient to reach the first page in the next segment to trigger the issue. Since 12 it's enough to just have prefetch enabled, the prefetch distance does not matter. Fixed by not adding non-existent pages to the TID bitmap. Backpatch all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that release is EOL). Backpatch-through: 9.6
* Fix more confusion in SP-GiST.Tom Lane2021-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | spg_box_quad_leaf_consistent unconditionally returned the leaf datum as leafValue, even though in its usage for poly_ops that value is of completely the wrong type. In versions before 12, that was harmless because the core code did nothing with leafValue in non-index-only scans ... but since commit 2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would unconditionally try to copy the value using the wrong datatype parameters. Said copying is a waste of time and space if we're not going to return the data, but it accidentally failed to fail until I fixed the datatype confusion in ac9099fc1. Hence, change spgNewHeapItem to not copy the datum unless we're actually going to return it later. This saves cycles and dodges the question of whether lossy opclasses are returning the right type. Also change spg_box_quad_leaf_consistent to not return data that might be of the wrong type, as insurance against somebody introducing a similar bug into the core code in future. It seems like a good idea to back-patch these two changes into v12 and v13, although I'm afraid to change spgNewHeapItem's mistaken idea of which datatype to use in those branches. Per buildfarm results from ac9099fc1. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
* Fix bug in WAL replay of COMMIT_TS_SETTS record.Fujii Masao2021-03-25
| | | | | | | | | | | | | | | | | | Previously the WAL replay of COMMIT_TS_SETTS record called TransactionTreeSetCommitTsData() with the argument write_xlog=true, which generated and wrote new COMMIT_TS_SETTS record. This should not be acceptable because it's during recovery. This commit fixes the WAL replay of COMMIT_TS_SETTS record so that it calls TransactionTreeSetCommitTsData() with write_xlog=false and doesn't generate new WAL during recovery. Back-patch to all supported branches. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Avoid possible crash while finishing up a heap rewrite.Tom Lane2021-03-23
| | | | | | | | | | | | | | | | | | | | end_heap_rewrite was not careful to ensure that the target relation is open at the smgr level before performing its final smgrimmedsync. In ordinary cases this is no problem, because it would have been opened earlier during the rewrite. However a crash can be reproduced by re-clustering an empty table with CLOBBER_CACHE_ALWAYS enabled. Although that exact scenario does not crash in v13, I think that's a chance result of unrelated planner changes, and the problem is likely still reachable with other test cases. The true proximate cause of this failure is commit c6b92041d, which replaced a call to heap_sync (which was careful about opening smgr) with a direct call to smgrimmedsync. Hence, back-patch to v13. Amul Sul, per report from Neha Sharma; cosmetic changes and test case by me. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com